Video Question Answering Using CLIP-Guided Visual-Text Attention

Shuhong Ye; Weikai Kong; Chenglin Yao; Jianfeng Ren; Xudong Jiang

CLIPガイド付きのビジュアルテキスト注意を使用したビデオ質問応答

ビデオとテキストのクロスモーダル学習は、ビデオ質問応答 (VideoQA) で重要な役割を果たします。このホワイトペーパーでは、VideoQA のクロスモーダル学習をガイドするために、多くの一般的なドメインの言語と画像のペアでトレーニングされた Contrastive Language-Image Pre-training (CLIP) を利用するためのビジュアルテキストアテンションメカニズムを提案します。具体的には、まずターゲットアプリケーションドメインから TimeSformer を使用してビデオ特徴を抽出し、BERT を使用してテキスト特徴を抽出し、CLIP を使用して、ドメイン固有学習を通じて一般知識ドメインから視覚テキスト特徴のペアを抽出します。次に、クロスドメイン学習を提案して、ターゲットドメインと一般ドメイン全体の視覚的特徴と言語的特徴の間の注意情報を抽出します。答えを予測するために、CLIP ガイド付きビジュアルテキスト機能のセットが統合されています。提案された方法は、MSVD-QA および MSRVTT-QA データセットで評価され、最先端の方法よりも優れています。

Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods.

updated: Mon Mar 06 2023 13:49:15 GMT+0000 (UTC)

published: Mon Mar 06 2023 13:49:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト