CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Huaishao Luo; Lei Ji; Ming Zhong; Yang Chen; Wen Lei; Nan Duan; Tianrui Li

CLIP4Clip：エンドツーエンドのビデオクリップ検索のためのCLIPの実証的研究

ビデオテキスト検索は、マルチモーダル研究において重要な役割を果たし、多くの実際のWebアプリケーションで広く使用されています。画像言語の事前トレーニングモデルであるCLIP（対照言語-画像事前トレーニング）は、Webで収集された画像テキストデータセットから学習する視覚的概念の力を実証しました。本論文では、CLIPモデルの知識をエンドツーエンドの方法でビデオ言語検索に転送するためのCLIP4Clipモデルを提案します。いくつかの質問は、実証的研究によって調査されます。1）画像機能がビデオテキスト検索に十分かどうか。 2）CLIPに基づく大規模なビデオテキストデータセットの事後事前トレーニングはパフォーマンスにどのように影響しますか？ 3）ビデオフレーム間の時間依存性をモデル化するための実際的なメカニズムは何ですか？そして4）ビデオテキスト検索タスクにおけるモデルのハイパーパラメータ感度。広範な実験結果は、CLIPから転送されたCLIP4Clipモデルが、MSR-VTT、MSVC、LSMDCなどのさまざまなビデオテキスト検索データセットでSOTA結果を達成できることを示しています。

Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, and LSMDC.

updated: Sun Apr 18 2021 13:59:50 GMT+0000 (UTC)

published: Sun Apr 18 2021 13:59:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト