Visually Grounded VQA by Lattice-based Retrieval

Daniel Reich; Felix Putze; Tanja Schultz

格子ベースの検索による視覚的に接地された VQA

Visual Question Answering (VQA) システムの Visual Grounding (VG) は、システムが質問とその回答を関連する画像領域にどの程度結び付けることができるかを表します。強力な VG を持つシステムは、直感的に解釈できると見なされ、シーンの理解が向上することを示唆しています。 VQA 精度のパフォーマンスは過去数年間で目覚ましい進歩を遂げましたが、VG パフォーマンスの明示的な改善とその評価は、全体的な精度の向上への道のりで後回しにされることがよくありました。この原因は、VQA システムの学習パラダイムの主な選択に由来します。これは、事前に決定された一連の回答オプションに対して識別分類器をトレーニングすることで構成されます。この作業では、分類の支配的な VQA モデリングパラダイムを打ち破り、情報検索タスクの観点から VQA を調査します。このように、開発されたシステムは、VG をコア検索手順に直接結び付けます。私たちのシステムは、質問から抽出された領域参照式と組み合わせて、特定の画像のシーングラフから派生した、重み付き有向非巡回グラフ、別名「格子」で動作します。私たちのアプローチを詳細に分析し、その特徴的な特性と限界について説明します。私たちのアプローチは、調査したシステムの中で最強の VG パフォーマンスを実現し、多くのシナリオで卓越した一般化機能を発揮します。

Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few years, explicit improvements to VG performance and evaluation thereof have often taken a back seat on the road to overall accuracy improvements. A cause of this originates in the predominant choice of learning paradigm for VQA systems, which consists of training a discriminative classifier over a predetermined set of answer options. In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task. As such, the developed system directly ties VG into its core search procedure. Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question. We give a detailed analysis of our approach and discuss its distinctive properties and limitations. Our approach achieves the strongest VG performance among examined systems and exhibits exceptional generalization capabilities in a number of scenarios.

updated: Tue Nov 15 2022 12:12:08 GMT+0000 (UTC)

published: Tue Nov 15 2022 12:12:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト