A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering

Alireza Salemi; Juan Altmayer Pizzorno; Hamed Zamani

知識集約型視覚的質問応答のための対称二重符号化高密度検索フレームワーク

Knowledge-Intensive Visual Question Answering (KI-VQA) とは、画像内に回答がない画像に関する質問に回答することを指します。このホワイトペーパーでは、リトリーバーとリーダーで構成される KI-VQA タスクの新しいパイプラインについて説明します。最初に、ドキュメントとクエリがユニモーダル (テキスト) およびマルチモーダルエンコーダーを使用して共有埋め込み空間にエンコードされる対称デュアルエンコーディング高密度検索フレームワークである DEDR を紹介します。これら 2 つのエンコーダーの表現空間間のギャップを埋める反復的な知識蒸留アプローチを紹介します。確立された 2 つの KI-VQA データセット、つまり OK-VQA と FVQA に関する広範な評価は、DEDR が最先端のベースラインよりも OK-VQA と FVQA でそれぞれ 11.6% と 30.9% 優れていることを示唆しています。 DEDR によって取得されたパッセージを利用して、KI-VQA タスクのテキスト回答を生成するためのエンコーダー/デコーダーマルチモーダルフュージョンインデコーダーモデルである MM-FiD をさらに導入します。 MM-FiD は、質問、画像、および取得された各パッセージを個別にエンコードし、そのデコーダーですべてのパッセージを一緒に使用します。文献の競合ベースラインと比較すると、このアプローチにより、OK-VQA および FVQA での質問応答精度がそれぞれ 5.5% および 8.5% 向上します。

Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image. This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader. First, we introduce DEDR, a symmetric dual encoding dense retrieval framework in which documents and queries are encoded into a shared embedding space using uni-modal (textual) and multi-modal encoders. We introduce an iterative knowledge distillation approach that bridges the gap between the representation spaces in these two encoders. Extensive evaluation on two well-established KI-VQA datasets, i.e., OK-VQA and FVQA, suggests that DEDR outperforms state-of-the-art baselines by 11.6% and 30.9% on OK-VQA and FVQA, respectively. Utilizing the passages retrieved by DEDR, we further introduce MM-FiD, an encoder-decoder multi-modal fusion-in-decoder model, for generating a textual answer for KI-VQA tasks. MM-FiD encodes the question, the image, and each retrieved passage separately and uses all passages jointly in its decoder. Compared to competitive baselines in the literature, this approach leads to 5.5% and 8.5% improvements in terms of question answering accuracy on OK-VQA and FVQA, respectively.

updated: Wed Apr 26 2023 16:14:39 GMT+0000 (UTC)

published: Wed Apr 26 2023 16:14:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト