Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu; Xiaohan Wang; Haipeng Luo; Jingdong Wang; Yi Yang; Wanli Ouyang

事前トレーニング済みの視覚言語モデルを使用したビデオ認識のための双方向のクロスモーダル知識探索

大規模な画像とテキストのペアで事前トレーニングされた視覚言語モデル (VLM) は、さまざまな視覚タスクで印象的な伝達可能性を示しています。このような強力な VLM から知識を伝達することは、効果的なビデオ認識モデルを構築するための有望な方向性です。ただし、この分野での現在の探査はまだ限られています。事前トレーニング済みの VLM の最大の価値は、ビジュアルドメインとテキストドメインの間の橋渡しにあると考えています。このホワイトペーパーでは、クロスモーダルブリッジを利用して双方向の知識を探索する、BIKE と呼ばれる新しいフレームワークを提案します。ビデオ認識。 ii) また、Text-to-Video の専門知識を使用して一時的な顕著性をパラメータなしでキャプチャし、強化されたビデオ表現につながる、Temporal Concept Spotting メカニズムも提示します。 Kinetics-400 & 600、UCF-101、HMDB-51、ActivityNet および Charades を含む 6 つの一般的なビデオデータセットに関する広範な研究は、私たちの方法が、一般、ゼロなどのさまざまな認識シナリオで最先端のパフォーマンスを達成することを示しています。 -shot、および少数ショットのビデオ認識。私たちの最高のモデルは、リリースされた CLIP モデルを使用して、挑戦的な Kinetics-400 で 88.6% の最先端の精度を達成しています。コードは https://github.com/whwu95/BIKE で入手できます。

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .

updated: Sat Mar 25 2023 12:12:30 GMT+0000 (UTC)

published: Sat Dec 31 2022 11:36:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト