Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval

Keyu Wen; Zhenshan Tan; Qingrong Cheng; Cheng Chen; Xiaodong Gu

視覚言語表現の学習と検索のための対照的なクロスモーダル知識共有事前トレーニング

最近、クロスモーダルの事前トレーニングタスクは、検索、キャプション、質問応答などのさまざまなダウンストリーム研究に幅広く適用されているため、ホットスポットになっています。ただし、既存の方法では、1ストリームの事前トレーニングモデルを採用して、クロスモーダル検索を実行するための統一された視覚言語表現を探索します。これは、計算の急増に悩まされがちです。さらに、従来のダブルストリーム構造は非常に効率的ですが、それでも重要なクロスモーダル相互作用が不足しているため、パフォーマンスが低下します。これらの課題に動機付けられて、テキストと画像の共同表現を把握するために、対照的なクロスモーダル知識共有事前トレーニング（COOKIE）を提案しました。構造的には、COOKIEは許容できる時間消費のために従来のダブルストリーム構造を採用しています。上記のダブルストリーム構造に固有の欠陥を克服するために、2つの効果的なモジュールを入念に設計します。具体的には、最初のモジュールは、テキストと画像を意味的に整列させることを目的として、ビジュアルエンコーダーとテキストエンコーダーのヘッド上に構築されたウェイトシェアリングトランスフォーマーです。この設計により、視覚的パスとテキストパスが同じセマンティクスに焦点を合わせることができます。もう1つは、異なるモデル間で知識を共有することを目的とした、特別に設計された3つの対照学習です。共有されたクロスモーダル知識は、ユニモーダル表現の研究を大幅に発展させ、シングルモーダル検索タスクを促進します。クロスモーダル検索、テキストマッチング、画像検索を含むマルチモーダルマッチング研究に関する広範な実験結果は、事前トレーニングモデルの計算効率と統計的指標の優れていることを明らかにしています。

Recently, the cross-modal pre-training task has been a hotspot because of its wide application in various down-streaming researches including retrieval, captioning, question answering and so on. However, exiting methods adopt a one-stream pre-training model to explore the united vision-language representation for conducting cross-modal retrieval, which easily suffer from the calculation explosion. Moreover, although the conventional double-stream structures are quite efficient, they still lack the vital cross-modal interactions, resulting in low performances. Motivated by these challenges, we put forward a Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) to grasp the joint text-image representations. Structurally, COOKIE adopts the traditional double-stream structure because of the acceptable time consumption. To overcome the inherent defects of double-stream structure as mentioned above, we elaborately design two effective modules. Concretely, the first module is a weight-sharing transformer that builds on the head of the visual and textual encoders, aiming to semantically align text and image. This design enables visual and textual paths focus on the same semantics. The other one is three specially designed contrastive learning, aiming to share knowledge between different models. The shared cross-modal knowledge develops the study of unimodal representation greatly, promoting the single-modal retrieval tasks. Extensive experimental results on multi-modal matching researches that includes cross-modal retrieval, text matching, and image retrieval reveal the superiors in calculation efficiency and statistical indicators of our pre-training model.

updated: Fri Jul 08 2022 15:28:15 GMT+0000 (UTC)

published: Sat Jul 02 2022 04:08:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト