VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval

Yan Gong; Georgina Cosma

VITR: クロスモーダル情報検索のための関係重視学習によるビジョントランスフォーマーの増強

関係に焦点を当てたクロスモーダル情報検索は、ユーザーのクエリで表現された関係に基づいて情報を取得することに重点を置いており、情報検索アプリケーションや次世代検索エンジンでは特に重要です。 Contrastive Language-Image Pre-training (CLIP) などの事前トレーニング済みネットワークは、クロスモーダル学習タスクで最先端のパフォーマンスを達成しましたが、これらのネットワークで使用される Vision Transformer (ViT) は集中する能力が限られています。画像領域関係について。具体的には、ViT は、画像領域と説明の間の配置を考慮せずに、画像と関連する説明をグローバルレベルで照合するようにトレーニングされます。この論文では、ローカルエンコーダーに基づいて画像領域の関係を抽出して推論することにより、ViT を強化する新しいネットワークである VITR を紹介します。 VITR は 2 つの主要なコンポーネントで構成されます。(1) ViT ベースのクロスモーダルネットワークの機能を拡張して、画像内の領域関係を抽出して推論します。（2）画像と説明の間の類似性スコアを予測するために、グローバルな知識で推論された結果を集約します。実験は、提案されたネットワークを、Flickr30K、RefCOCOg、および CLEVR データセットの関係に焦点を当てたクロスモーダル情報検索タスクに適用することによって実行されました。結果は、提案された VITR ネットワークが、画像からテキストへ、およびテキストから画像へのクロスモーダル情報検索タスクの両方で、CLIP、VSE∞、および VSRN++ を含む他のさまざまな最先端のネットワークよりも優れていることを明らかにしました。

Relation-focused cross-modal information retrieval focuses on retrieving information based on relations expressed in user queries, and it is particularly important in information retrieval applications and next-generation search engines. While pre-trained networks like Contrastive Language-Image Pre-training (CLIP) have achieved state-of-the-art performance in cross-modal learning tasks, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a Local encoder. VITR comprises two main components: (1) extending the capabilities of ViT-based cross-modal networks to extract and reason with region relations in images; and (2) aggregating the reasoned results with the global knowledge to predict the similarity scores between images and descriptions. Experiments were carried out by applying the proposed network to relation-focused cross-modal information retrieval tasks on the Flickr30K, RefCOCOg, and CLEVR datasets. The results revealed that the proposed VITR network outperformed various other state-of-the-art networks including CLIP, VSE∞, and VSRN++ on both image-to-text and text-to-image cross-modal information retrieval tasks.

updated: Mon Apr 24 2023 15:36:38 GMT+0000 (UTC)

published: Mon Feb 13 2023 13:34:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト