Prompting Visual-Language Models for Efficient Video Understanding

Chen Ju; Tengda Han; Kunhao Zheng; Ya Zhang; Weidi Xie

効率的なビデオ理解のための視覚言語モデルの促進

画像ベースの視覚言語（I-VL）の事前トレーニングは、大規模なWebデータから視覚とテキストの共同表現を学習するのに大きな成功を収めており、ゼロショットの一般化に優れた能力を示しています。このホワイトペーパーでは、事前にトレーニングされたI-VLモデルを効率的に適応させ、最小限のトレーニングでリソースを大量に消費するビデオ理解タスクにその強力な機能を活用するための、シンプルでありながら強力なベースラインを示します。具体的には、ビデオ関連のタスクを事前トレーニングの目的と同じ形式に変換する、連続プロンプトベクトルと呼ばれるいくつかのランダムベクトルを最適化することを提案します。さらに、静止画像とビデオの間のギャップを埋めるために、時間情報は、フレーム単位の視覚的特徴の上に積み重ねられた軽量のトランスフォーマーでエンコードされます。実験的に、重要なコンポーネントを分析するために広範なアブレーション研究を実施します。閉集合、数ショット、およびゼロショットのシナリオ全体で、アクション認識、アクションローカリゼーション、およびテキストビデオ検索の10の公開ベンチマークで、最適化にもかかわらず、既存の方法に対して競争力のある、または最先端のパフォーマンスを実現します。大幅に少ないパラメータ。

Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.

updated: Fri Jul 15 2022 08:31:45 GMT+0000 (UTC)

published: Wed Dec 08 2021 18:58:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト