A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Zhixiong Zeng; Wenji Mao

教師ありクロスモーダル検索のための視覚言語事前訓練モデルの包括的な経験的研究

Cross-Modal Retrieval（CMR）は、マルチモーダルコンピューティングと情報検索全体にわたる重要な研究トピックであり、あるタイプのデータをクエリとして使用して別のタイプの関連データを取得し、多くの実際のアプリケーションで広く使用されています。最近、CLIPによって表される視覚言語の事前トレーニング済みモデルは、視覚的およびテキスト表現の学習の優位性と、さまざまな視覚および言語関連のタスクでの印象的なパフォーマンスを実証しました。 CLIPと以前の事前トレーニング済みモデルは、教師なしCMRで大幅なパフォーマンスの向上を示しましたが、マルチモーダルクラスレベルの関連付けがないため、これらの事前トレーニング済みモデルの教師ありCMRへのパフォーマンスと影響はほとんど調査されませんでした。このホワイトペーパーでは、CLIPを現在の代表的な視覚言語の事前トレーニング済みモデルとして採用し、包括的な実証的研究を実施し、CLIPのパフォーマンスと教師ありCMRへの影響に関する洞察を提供します。この目的のために、まず、教師ありCMRを実行するためのバックボーンネットワークとして事前トレーニング済みCLIPを使用する新しいモデルCLIP4CMR（教師ありクロスモーダル検索用CLIP）を提案します。次に、最も一般的なペアワイズ損失、クラスワイズ損失、ハイブリッド損失など、CMRの既存の損失関数設計を再検討し、CLIPの適用に関する洞察を提供します。さらに、教師ありCMRのいくつかの関連する問題を調査し、モダリティの不均衡に対するロバスト性やハイパーパラメーターに対する感度など、CLIP4CMRを介してこの分野の新しい視点を提供します。広範な実験結果は、CLIP4CMRがベンチマークデータセットWikipedia、NUS-WIDE、Pascal-Sentence、およびXmediaNetを大幅に改善してSOTA結果を達成することを示しています。私たちのデータとコードはhttps://github.com/zhixiongz/CLIP4CMRで公開されています。

Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type, and has been widely used in many real-world applications. Recently, the vision-language pre-trained model represented by CLIP has demonstrated its superiority of learning visual and textual representations and its impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in unsupervised CMR, the performance and impact of these pre-trained models on supervised CMR were rarely explored due to the lack of multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study and provide insights on its performance and impact on supervised CMR. To this end, we first propose a novel model CLIP4CMR (CLIP For supervised Cross-Modal Retrieval) that employs pre-trained CLIP as backbone network to perform supervised CMR. We then revisit the existing loss function design in CMR, including the most common pair-wise losses, class-wise losses and hybrid ones, and provide insights on applying CLIP. Moreover, we investigate several concerned issues in supervised CMR and provide new perspectives for this field via CLIP4CMR, including the robustness to modality imbalance and the sensitivity to hyper-parameters. Extensive experimental results show that the CLIP4CMR achieves SOTA results with significant improvements on the benchmark datasets Wikipedia, NUS-WIDE, Pascal-Sentence and XmediaNet. Our data and codes are publicly available at https://github.com/zhixiongz/CLIP4CMR.

updated: Sat Jan 08 2022 06:00:22 GMT+0000 (UTC)

published: Sat Jan 08 2022 06:00:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト