Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Xudong Lin; Simran Tiwari; Shiyuan Huang; Manling Li; Mike Zheng Shou; Heng Ji; Shih-Fu Chang

マルチチャネルビデオ言語検索のための事前学習済み対照モデルの迅速な適応に向けて

マルチチャネルビデオ言語検索では、モデルがさまざまなチャネル (ビデオ + 質問、ビデオ + スピーチなど) からの情報を理解し、ビデオをテキスト応答またはクエリに正しくリンクする必要があります。幸いなことに、対照的なマルチモーダルモデルは、画像/ビデオおよびテキスト内のエンティティを整列させるのに非常に効果的であることが示されています。テキスト対比モデルは、SimCSE などの識別可能な文の埋め込みを生成する強力な能力のために、最近広く研究されています。ただし、限られたデータとリソースで、これら 2 つの行をマルチチャネルのビデオ言語検索に迅速に適応させる明確な方法はありません。この論文では、ビデオを表現する方法と、ビデオとテキスト情報を融合する方法という 2 つの軸を持つ原理モデルデザインスペースを特定します。最近の方法の分類に基づいて、連続特徴ベクトルまたは離散テキストトークンを使用してビデオを表現するオプションを調査します。融合方法については、マルチモーダルトランスフォーマーまたは事前トレーニング済みの対照的なテキストモデルの使用を検討します。 5 つのビデオ言語データセットで 4 つの組み合わせを広く評価します。驚くべきことに、個別のテキストトークンを事前トレーニング済みの対照的なテキストモデルと組み合わせると最高のパフォーマンスが得られることがわかりました。これは、数百万のビデオテキストデータで追加のトレーニングを行うことなく、iVQA および How2QA データセットで最先端のパフォーマンスを上回ることさえあります。さらに分析すると、これはビデオをテキストトークンとして表現することで主要な視覚情報がキャプチャされ、テキストトークンは、対照的な事前トレーニングプロセスの後に強力な検索者であるテキストモデルと自然に一致するためであることがわかります。すべての実証分析は、手頃な価格でアップグレード可能なマルチモーダルインテリジェンスに関する将来の研究の強固な基盤を確立します。

Multi-channel video-language retrieval require models to understand information from different channels (e.g. video+question, video+speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on affordable and upgradable multimodal intelligence.

updated: Tue Apr 11 2023 02:29:20 GMT+0000 (UTC)

published: Sun Jun 05 2022 01:43:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト