Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Shir Gur; Natalia Neverova; Chris Stauffer; Ser-Nam Lim; Douwe Kiela; Austin Reiter

マルチモーダル分類のためのクロスモーダル検索拡張

外部の知識ソースよりも検索コンポーネントの使用における最近の進歩は、自然言語処理におけるさまざまなダウンストリームタスクに対して印象的な結果を示しています。ここでは、視覚的な質問応答（VQA）を改善するために、画像の非構造化外部知識ソースとそれに対応するキャプションの使用について説明します。まず、同じスペースに画像とキャプションを埋め込むための新しいアライメントモデルをトレーニングします。これにより、同様の方法で画像キャプション検索のパフォーマンスが大幅に向上します。次に、トレーニング済みの線形モデルを使用した検索拡張マルチモーダルトランスフォーマーが、強力なベースラインよりもVQAの結果を改善することを示します。さらに、このアプローチの可能性を確立するために広範な実験を実施し、ホットスワップインデックスなどの推論時間の新しいアプリケーションを検討します。

Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.

updated: Fri Apr 16 2021 13:27:45 GMT+0000 (UTC)

published: Fri Apr 16 2021 13:27:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト