VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval

Yan Gong; Georgina Cosma; Axel Finke

VITR: クロスモーダル情報検索のための関係中心学習によるビジョントランスフォーマーの強化

ユーザーのクエリで表現される関係は、クロスモーダルな情報検索にとって不可欠です。関係に焦点を当てたクロスモーダル検索は、これらの関係に対応する情報を検索することを目的としており、異なるモダリティにまたがる効果的な検索を可能にします。 Contrastive Language-Image Pre-training (CLIP) などの事前トレーニング済みネットワークは、さまざまなクロスモーダル学習タスクにおける優れたパフォーマンスで大きな注目を集め、高く評価されています。ただし、これらのネットワークで使用されるビジョントランスフォーマー (ViT) は、画像領域の関係に焦点を当てる能力に限界があります。具体的には、ViT は、画像領域と説明の間の位置合わせを考慮せずに、グローバルレベルで画像と関連する説明を照合するようにトレーニングされます。本稿では、ローカルエンコーダに基づいて画像領域の関係を抽出および推論することにより ViT を強化する新しいネットワーク VITR を紹介します。 VITR は 2 つの主要なコンポーネントで構成されています。まず、画像内に存在する領域関係を抽出して推論できるようにすることで、ViT ベースのクロスモーダルネットワークの機能を拡張します。次に、VITR には、推論された結果とグローバルな知識を組み合わせて、画像と説明の間の類似性スコアを予測する融合モジュールが組み込まれています。提案された VITR ネットワークは、関係に焦点を当てたクロスモーダル情報検索のタスクに関する実験を通じて評価されました。 RefCOCOg、CLEVR、および Flickr30K データセットの分析から得られた結果は、提案された VITR ネットワークが、画像からテキストへの検索およびテキストから画像への検索において常に最先端のネットワークよりも優れていることを実証しました。

The relations expressed in user queries are vital for cross-modal information retrieval. Relation-focused cross-modal retrieval aims to retrieve information that corresponds to these relations, enabling effective retrieval across different modalities. Pre-trained networks, such as Contrastive Language-Image Pre-training (CLIP), have gained significant attention and acclaim for their exceptional performance in various cross-modal learning tasks. However, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a local encoder. VITR is comprised of two key components. Firstly, it extends the capabilities of ViT-based cross-modal networks by enabling them to extract and reason with region relations present in images. Secondly, VITR incorporates a fusion module that combines the reasoned results with global knowledge to predict similarity scores between images and descriptions. The proposed VITR network was evaluated through experiments on the tasks of relation-focused cross-modal information retrieval. The results derived from the analysis of the RefCOCOg, CLEVR, and Flickr30K datasets demonstrated that the proposed VITR network consistently outperforms state-of-the-art networks in image-to-text and text-to-image retrieval.

updated: Thu Jul 27 2023 21:48:30 GMT+0000 (UTC)

published: Mon Feb 13 2023 13:34:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト