A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Zhixiong Zeng; Wenji Mao

監視されたクロスモーダル検索のための視覚言語事前訓練モデルの包括的な経験的研究

Cross-Modal Retrieval（CMR）は、マルチモーダルコンピューティングと情報検索全体にわたる重要な研究トピックであり、あるタイプのデータをクエリとして使用して、別のタイプの関連データを取得します。これは、多くの実際のアプリケーションで広く使用されています。最近、CLIPによって表される視覚言語の事前トレーニング済みモデルは、視覚的およびテキスト表現の学習におけるその優位性を示し、さまざまな視覚および言語関連のタスクで印象的なパフォーマンスを獲得します。 CLIPと以前の事前トレーニング済みモデルは、教師なしCMRで大幅なパフォーマンスの向上を示しましたが、マルチモーダルクラスの共通表現がないため、これらの事前トレーニング済みモデルが教師なしCMRに与えるパフォーマンスと影響はほとんど調査されませんでした。レベルの関連付け。この論文では、CLIPを現在の代表的な視覚言語の事前訓練モデルとして採用し、包括的な実証的研究を実施します。そのパフォーマンスと監視対象CMRへの影響を評価し、いくつかの重要な調査の質問に答えようとします。この目的のために、我々は最初に、監視されたCMRを実行するためのバックボーンネットワークとして事前に訓練されたCLIPを使用する新しいモデルCLIP4CMR（クロスモーダル検索のためのCLIP拡張ネットワーク）を提案します。次に、CLIP4CMRフレームワークを使用して、現在のCMRメソッドのさまざまな学習目標の設計を再検討し、モデル設計に関する新しい洞察を提供します。さらに、モダリティの不均衡に対するロバスト性やハイパーパラメータに対する感度など、CMRを適用する際に最も懸念される側面を調査し、実際のアプリケーションに新しい視点を提供します。広範な実験を通じて、CLIP4CMRがベンチマークデータセットの顕著な改善によりSOTAの結果を達成し、モデル設計と実際の考慮事項に重要な影響を与える、監視対象CMRの主要な研究課題を経験的に研究するための基本的なフレームワークとして使用できることを示します。

Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type. It has been widely used in many real-world applications. Recently, the vision-language pre-trained models represented by CLIP demonstrate its superiority in learning the visual and textual representations and gain impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in the unsupervised CMR, the performance and impact of these pre-trained models on the supervised CMR were rarely explored due to the lack of common representation for the multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study. We evaluate its performance and impact on the supervised CMR, and attempt to answer several key research questions. To this end, we first propose a novel model CLIP4CMR (CLIP enhanced network for Cross-Modal Retrieval) that employs the pre-trained CLIP as backbone network to perform the supervised CMR. Then by means of the CLIP4CMR framework, we revisit the design of different learning objectives in current CMR methods to provide new insights on model design. Moreover, we investigate the most concerned aspects in applying CMR, including the robustness to modality imbalance and sensitivity to hyper-parameters, to provide new perspectives for practical applications. Through extensive experiments, we show that CLIP4CMR achieves the SOTA results with prominent improvements on the benchmark datasets, and can be used as a fundamental framework to empirically study the key research issues of the supervised CMR, with significant implications for model design and practical considerations.

updated: Sun Apr 17 2022 15:32:25 GMT+0000 (UTC)

published: Sat Jan 08 2022 06:00:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト