MaXM: Towards Multilingual Visual Question Answering

Soravit Changpinyo; Linting Xue; Idan Szpektor; Ashish V. Thapliyal; Julien Amelot; Michal Yarom; Xi Chen; Radu Soricut

MaXM: 多言語の視覚的質問応答に向けて

Visual Question Answering (VQA) は、主に英語というレンズを通して研究されてきました。しかし、他の言語で同じ方法で VQA に取り組むには、かなりのリソースが必要になります。この論文では、データとモデリングの両方の面で、多言語の視覚的質問応答 (mVQA) に対するスケーラブルなソリューションを提案します。最初に、質問と回答を直接収集する従来のアプローチよりもはるかに少ない人間の注釈作業を必要とする mVQA データ生成への翻訳ベースのフレームワークを提案します。次に、フレームワークを Crossmodal-3600 データセットの多言語キャプションに適用し、効率的な注釈プロトコルを開発して、7 つの多様な言語でのテスト専用 VQA ベンチマークである MaXM を作成します。最後に、統一された拡張可能なオープンエンドのエンドツーエンドの mVQA モデリングへのアプローチを提案し、13 の言語で強力なパフォーマンスを示します。

Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose scalable solutions to multilingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers. Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages. Finally, we propose an approach to unified, extensible, open-ended, and end-to-end mVQA modeling and demonstrate strong performance in 13 languages.

updated: Sat Feb 18 2023 11:11:43 GMT+0000 (UTC)

published: Mon Sep 12 2022 16:53:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト