CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Hongwei Xue; Yuchong Sun; Bei Liu; Jianlong Fu; Ruihua Song; Houqiang Li; Jiebo Luo

CLIP-ViP: 事前トレーニング済みの画像テキストモデルをビデオ言語表現の配置に適応させる

CLIP などの事前トレーニング済みの画像テキストモデルは、Web で収集された大規模な画像テキストデータから学習した視覚言語表現の強力な力を実証しています。十分に学習された視覚的特徴に照らして、いくつかの既存の作品は画像表現をビデオ領域に移し、良い結果を達成しています。ただし、画像言語の事前トレーニング済みモデル (CLIP など) をビデオ言語の事前トレーニング (事前トレーニング後) に利用する方法はまだ調査中です。このホワイトペーパーでは、次の 2 つの質問を調査します。 2) これらの要因の影響を軽減する方法は?一連の比較実験と分析を通じて、言語ソース間のデータスケールとドメインギャップが大きな影響を与えることがわかりました。これらに動機づけられて、CLIP、すなわちCLIP-ViPに基づくビデオプロキシメカニズムを備えたオムニソースクロスモーダル学習方法を提案します。広範な結果は、私たちのアプローチがビデオテキスト検索でのCLIPのパフォーマンスを大幅に改善することを示しています。私たちのモデルは、MSR-VTT、DiDeMo、LSMDC、ActivityNet などのさまざまなデータセットでも SOTA の結果を達成しています。コードと事前トレーニング済みの CLIP-ViP モデルを https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP でリリースしています。

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

updated: Wed Sep 14 2022 05:47:02 GMT+0000 (UTC)

published: Wed Sep 14 2022 05:47:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト