Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Nicola Messina; Giuseppe Amato; Andrea Esuli; Fabrizio Falchi; Claudio Gennaro; Stéphane Marchand-Maillet

トランスフォーマーエンコーダーを使用したクロスモーダル検索のためのきめ細かいビジュアルテキストアラインメント

ディープラーニングベースのビジュアルテキスト処理システムの進化にもかかわらず、正確なマルチモーダルマッチングは依然として困難な作業です。この作業では、グローバルな画像文レベルでのみ監視を使用して、単語領域の配置に基づく画像文のマッチングによるクロスモーダル検索のタスクに取り組みます。具体的には、Transformer Encoder Reasoning and Alignment Network（TERAN）と呼ばれる新しいアプローチを紹介します。 TERANは、両方のモダリティの有益な豊かさを維持するために、画像と文の基礎となるコンポーネント、つまり、それぞれ画像領域と単語の間のきめ細かい一致を強制します。 TERANは、MS-COCOデータセットとFlickr30kデータセットの両方で画像検索タスクに関する最新の結果を取得します。さらに、MS-COCOでは、文検索タスクの現在のアプローチよりも優れています。 TERANは、スケーラブルなクロスモーダル情報検索に重点を置いており、視覚データパイプラインとテキストデータパイプラインを適切に分離するように設計されています。クロスアテンションリンクは、大規模な検索システムでのオンライン検索とオフラインインデックス作成の手順に必要な視覚的およびテキスト的特徴を個別に抽出する機会を無効にします。この点で、TERANは、損失計算の直前の最終調整フェーズでのみ、2つのドメインからの情報をマージします。 TERANによって生成されたきめ細かいアライメントは、大規模なクロスモーダル情報検索のための効果的かつ効率的な方法の研究への道を開くと主張します。私たちは、私たちのアプローチの有効性を関連する最先端の方法と比較します。 MS-COCO 1Kテストセットでは、Recall @ 1メトリックの画像タスクと文検索タスクでそれぞれ5.7％と3.5％の改善が得られました。実験に使用したコードは、GitHub（https://github.com/mesnico/TERAN）で公開されています。

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.

updated: Tue Mar 02 2021 16:12:52 GMT+0000 (UTC)

published: Wed Aug 12 2020 11:02:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト