VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Shen Yan; Tao Zhu; Zirui Wang; Yuan Cao; Mi Zhang; Soham Ghosh; Yonghui Wu; Jiahui Yu

VideoCoCa: 対照的なキャプショナーからのゼロショット転送によるビデオテキストモデリング

基本的なビデオテキストモデルを確立するための効率的なアプローチを探ります。事前トレーニング済みの画像テキストコントラストキャプション (CoCa) モデルを最大限に再利用し、最小限の追加トレーニングでビデオテキストタスクに適応させる VideoCoCa を提示します。以前の研究では、さまざまなクロスフレーム融合モジュールを使用して画像テキストモデルを適応させていましたが、CoCa の生成的注意プーリングと対照的注意プーリングレイヤーは、フラット化されたフレーム埋め込みに即座に適応可能であり、ゼロで最先端の結果をもたらすことがわかりました。ショットビデオ分類とゼロショットテキストからビデオへの検索。さらに、VideoCoCa に加えて軽量の微調整を調査し、ビデオの質問応答とビデオのキャプションで強力な結果を達成します。

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

updated: Wed Mar 15 2023 06:48:23 GMT+0000 (UTC)

published: Fri Dec 09 2022 16:39:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト