WebQA: Multihop and Multimodal QA

Yingshan Chang; Mridu Narang; Hisami Suzuki; Guihong Cao; Jianfeng Gao; Yonatan Bisk

WebQA：マルチホップおよびマルチモーダルQA

Web検索は、基本的にマルチモーダルでマルチホップです。多くの場合、質問をする前でも、画像検索に直接アクセスして回答を見つけることを選択します。さらに、単一のソースから回答を見つけることはめったにありませんが、含意を通じて情報と理由を集約します。この日常的な発生の頻度にもかかわらず、現在、テキストおよび自由形式の視覚的ソースからの長い形式の自然言語の質問に答えるために単一のモデルを必要とする統一された質問応答ベンチマークはありません-人間の経験に似ています。 WebQAを使用して、自然言語とコンピュータービジョンコミュニティの間のこのギャップを埋めることを提案します。 A.マルチホップテキストクエリは大規模なトランスフォーマーモデルでは困難であり、B。既存のマルチモーダルトランスフォーマーと視覚的表現はオープンドメインの視覚的クエリではうまく機能しないことを示します。コミュニティに対する私たちの課題は、ソースのモダリティに関係なく、シームレスに移行して推論する統合されたマルチモーダル推論モデルを作成することです。

Web search is fundamentally multimodal and multihop. Often, even before asking a question we choose to go directly to image search to find our answers. Further, rarely do we find an answer from a single source but aggregate information and reason through implications. Despite the frequency of this everyday occurrence, at present, there is no unified question answering benchmark that requires a single model to answer long-form natural language questions from text and open-ended visual sources -- akin to a human's experience. We propose to bridge this gap between the natural language and computer vision communities with WebQA. We show that A. our multihop text queries are difficult for a large-scale transformer model, and B. existing multi-modal transformers and visual representations do not perform well on open-domain visual queries. Our challenge for the community is to create a unified multimodal reasoning model that seamlessly transitions and reasons regardless of the source modality.

updated: Wed Sep 01 2021 19:43:59 GMT+0000 (UTC)

published: Wed Sep 01 2021 19:43:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト