Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

Hao Li; Jinfa Huang; Peng Jin; Guoli Song; Qi Wu; Jie Chen

人間のようなテキストベースの視覚的質問応答のための 3D 空間推論に向けて

Text-based Visual Question Answering~(TextVQA) は、複数のシーンテキストを使用して、画像に関する特定の質問に対して正しい回答を生成することを目的としています。ほとんどの場合、テキストはオブジェクトの表面に自然に付着します。したがって、TextVQA では、テキストとオブジェクトの間の空間的な推論が重要です。ただし、既存のアプローチは、入力画像から学習した 2D 空間情報内に制約されており、トランスフォーマーベースのアーキテクチャに依存して、融合プロセス中に暗黙的に推論します。この設定では、これらの 2D 空間推論アプローチは、同じイメージプレーン上のビジュアルオブジェクトとシーンテキストの間のきめの細かい空間関係を区別できないため、TextVQA モデルの解釈可能性とパフォーマンスが損なわれます。この論文では、3D 幾何学的情報を人間のような空間推論プロセスに導入して、重要なオブジェクトのコンテキスト知識を段階的に取得します。重要なオブジェクトのコンテキスト知識をキャプチャするための 3D 幾何学的情報を導入することにより、人間のような空間推論プロセスを策定します。モデルの 3D 空間関係の理解を深めるために、具体的には、(i) ~重要なオブジェクトの関心領域を正確に特定するための関係予測モジュールを提案します。 (ii) ~重要なオブジェクトに従って OCR トークンの注意を調整するための深度認識注意調整モジュールを設計します。広範な実験により、私たちの方法が TextVQA および ST-VQA データセットで最先端のパフォーマンスを達成することが示されています。さらに心強いことに、私たちのモデルは、TextVQA と ST-VQA の有効な分割で空間的推論を含む質問で、5.7% と 12.1% の明確なマージンで他のモデルを上回っています。さらに、テキストベースの画像キャプションタスクでのモデルの一般化可能性も検証します。

Text-based Visual Question Answering~(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial reasoning between texts and objects is crucial in TextVQA. However, existing approaches are constrained within 2D spatial information learned from the input images and rely on transformer-based architectures to reason implicitly during the fusion process. Under this setting, these 2D spatial reasoning approaches cannot distinguish the fine-grain spatial relations between visual objects and scene texts on the same image plane, thereby impairing the interpretability and performance of TextVQA models. In this paper, we introduce 3D geometric information into a human-like spatial reasoning process to capture the contextual knowledge of key objects step-by-step. %we formulate a human-like spatial reasoning process by introducing 3D geometric information for capturing key objects' contextual knowledge. To enhance the model's understanding of 3D spatial relationships, Specifically, (i)~we propose a relation prediction module for accurately locating the region of interest of critical objects; (ii)~we design a depth-aware attention calibration module for calibrating the OCR tokens' attention according to critical objects. Extensive experiments show that our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets. More encouragingly, our model surpasses others by clear margins of 5.7% and 12.1% on questions that involve spatial reasoning in TextVQA and ST-VQA valid split. Besides, we also verify the generalizability of our model on the text-based image captioning task.

updated: Thu Jun 15 2023 02:38:25 GMT+0000 (UTC)

published: Wed Sep 21 2022 12:49:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト