VisQA: X-raying Vision and Language Reasoning in Transformers

Theo Jaunet; Corentin Kervadec; Romain Vuillemot; Grigory Antipov; Moez Baccouche; Christian Wolf

VisQA：トランスフォーマーのX線ビジョンと言語推論

視覚的な質問応答システムは、入力画像が与えられた場合の自由回答形式のテキスト質問への回答を対象としています。これらは、視覚障害者への支援など、HCIで主に使用される高レベルの推論を学習するためのテストベッドです。最近の調査によると、最先端のモデルは、トレーニングデータのバイアスやショートカットを利用して回答を生成する傾向があり、必要な推論手順を実行する代わりに、入力画像を見ないこともあります。 VisQAを紹介します。これは、推論とバイアスの悪用に関するこの質問を調査する視覚分析ツールです。最先端のニューラルモデルの重要な要素であるトランスフォーマーのアテンションマップを公開します。私たちの作業仮説は、モデル予測につながる推論ステップが注意分布から観察可能であり、これは視覚化に特に役立ちます。 VisQAの設計プロセスは、深層学習と視覚言語推論の分野からのよく知られたバイアスの例によって動機付けられ、2つの方法で評価されました。まず、機械学習、ビジョンと言語の推論、データ分析の3つの分野のコラボレーションの結果、この作業により、VQAのニューラルモデルのバイアス活用についての理解が深まり、最終的にはその設計とオラクルモデルから推論パターンを転送する方法の提案によるトレーニング。次に、VisQAの設計、および複数の専門家によるモデル決定プロセスの分析を対象としたVisQAの目標指向の評価についても報告し、ユーザーがモデルの内部動作にアクセスできるようにする証拠を提供します。

Visual Question Answering systems target answering open-ended textual questions given input images. They are a testbed for learning high-level reasoning with a primary use in HCI, for instance assistance for the visually impaired. Recent research has shown that state-of-the-art models tend to produce answers exploiting biases and shortcuts in the training data, and sometimes do not even look at the input image, instead of performing the required reasoning steps. We present VisQA, a visual analytics tool that explores this question of reasoning vs. bias exploitation. It exposes the key element of state-of-the-art neural models -- attention maps in transformers. Our working hypothesis is that reasoning steps leading to model predictions are observable from attention distributions, which are particularly useful for visualization. The design process of VisQA was motivated by well-known bias examples from the fields of deep learning and vision-language reasoning and evaluated in two ways. First, as a result of a collaboration of three fields, machine learning, vision and language reasoning, and data analytics, the work lead to a better understanding of bias exploitation of neural models for VQA, which eventually resulted in an impact on its design and training through the proposition of a method for the transfer of reasoning patterns from an oracle model. Second, we also report on the design of VisQA, and a goal-oriented evaluation of VisQA targeting the analysis of a model decision process from multiple experts, providing evidence that it makes the inner workings of models accessible to users.

updated: Tue Jul 20 2021 09:57:29 GMT+0000 (UTC)

published: Fri Apr 02 2021 08:08:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト