Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Shen Yan; Tao Zhu; Zirui Wang; Yuan Cao; Mi Zhang; Soham Ghosh; Yonghui Wu; Jiahui Yu

対照的なキャプショナーからのゼロショット転送によるビデオテキストモデリング

この作業では、オープン語彙ビデオ分類、テキストからビデオへの検索、ビデオキャプション、ビデオ質問応答などのタスクの基本的なビデオテキストモデルを確立するための効率的なアプローチを探ります。事前トレーニング済みの画像テキストコントラストキャプション (CoCa) モデルを再利用する VideoCoCa を提示し、最小限の追加トレーニングでビデオテキストタスクに適応させます。以前の研究では、さまざまなクロスフレーム融合モジュール (たとえば、クロスフレームアテンションレイヤーや知覚リサンプラー) を使用して画像テキストモデルを適応させ、ビデオテキストデータの変更されたアーキテクチャを微調整していましたが、驚くべきことに、生成的注意プーリングと対照的注意画像とテキストの CoCa 設計におけるプーリングレイヤーは、「フラット化されたフレーム埋め込み」に即座に適応可能であり、多くのビデオテキストタスクに対して強力なゼロショット転送ベースラインを生成します。具体的には、事前トレーニング済みの画像テキスト CoCa の凍結画像エンコーダーは、各ビデオフレームを入力として受け取り、合計 \(T\) ビデオフレームに対してフレームごとに \(N\) トークン埋め込みを生成します。 \(N ×T\) トークンの埋め込みを、凍結されたビデオ表現の長いシーケンスとして平坦化し、CoCa の生成的注意プーリングと対照的注意プーリングを上に適用します。プーリングレイヤーを含むすべてのモデルの重みは、画像とテキストの CoCa 事前トレーニング済みモデルから直接読み込まれます。ビデオまたはビデオテキストデータがなくても、VideoCoCa のゼロショット転送ベースラインは、Kinetics 400/600/700、UCF101、HMDB51、および Charades のゼロショットビデオ分類で最先端の結果をすでに達成しています。 -MSR-VTT および ActivityNet キャプションでのテキストからビデオへの取得。また、VideoCoCa に加えて軽量の微調整を検討し、ビデオの質問応答 (iVQA、MSRVTT-QA、MSVD-QA) とビデオのキャプション (MSR-VTT、ActivityNet、Youcook2) で強力な結果を達成しています。私たちのアプローチは、将来の研究のためのシンプルで効果的なビデオテキストのベースラインを確立します。

This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N ×T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.

updated: Fri Dec 09 2022 16:39:09 GMT+0000 (UTC)

published: Fri Dec 09 2022 16:39:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト