Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer

Guangyi Chen; Xiao Liu; Guangrun Wang; Kun Zhang; Philip H. S. Torr; Xiao-Ping Zhang; Yansong Tang

Tem-adapter: ビデオの質問の回答に画像テキストの事前トレーニングを適応させる

ビデオ言語の事前トレーニング済みモデルは、ビデオ質問応答 (VideoQA) タスクのガイドにおいて顕著な成功を収めています。ただし、ビデオシーケンスの長さが原因で、大規模なビデオベースのモデルをトレーニングすると、画像ベースのモデルをトレーニングするよりもかなり高いコストが発生します。これにより、画像とビデオの領域の間に明らかなギャップがあるにもかかわらず、画像ベースの事前トレーニングからの知識を活用する動機が生まれました。これらのギャップを埋めるために、この論文では、視覚的な時間アライナーとテキストのセマンティックアライナーによる時間ダイナミクスと複雑なセマンティクスの学習を可能にする Tem-Adapter を提案します。下流のタスク目標のみに焦点を当てた従来の事前トレーニング済み知識適応手法とは異なり、Temporal Aligner は、時間的依存関係の学習を促進することを目的とした追加の言語ガイドによる自己回帰タスクを導入します。その目的は、歴史的な手がかりと言語ガイダンスに基づいて将来の状態を予測することです。イベントの進行を説明します。さらに、セマンティックギャップを削減し、テキスト表現をより適切なイベント説明に適応させるために、最初に質問と回答のペアをイベント説明として融合するテンプレートを設計し、次にビデオシーケンス全体をガイドとして Transformer デコーダーを学習する Semantic Aligner を導入します。洗練。 Tem-Adapter とさまざまな事前トレーニング転送方法を 2 つの VideoQA ベンチマークで評価しました。大幅なパフォーマンスの向上により、この方法の有効性が実証されました。

Video-language pre-trained models have shown remarkable success in guiding video question-answering (VideoQA) tasks. However, due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones. This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains. To bridge these gaps, in this paper, we propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner. Unlike conventional pretrained knowledge adaptation methods that only concentrate on the downstream task objective, the Temporal Aligner introduces an extra language-guided autoregressive task aimed at facilitating the learning of temporal dependencies, with the objective of predicting future states based on historical clues and language guidance that describes event progression. Besides, to reduce the semantic gap and adapt the textual representation for better event description, we introduce a Semantic Aligner that first designs a template to fuse question and answer pairs as event descriptions and then learns a Transformer decoder with the whole video sequence as guidance for refinement. We evaluate Tem-Adapter and different pre-train transferring methods on two VideoQA benchmarks, and the significant performance improvement demonstrates the effectiveness of our method.

updated: Wed Aug 16 2023 15:00:50 GMT+0000 (UTC)

published: Wed Aug 16 2023 15:00:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト