Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Yang Liu; Guanbin Li; Liang Lin

イベントレベルの視覚的質問応答のためのクロスモーダル因果関係推論

既存の視覚的質問応答方法は、クロスモーダルの偽の相関関係を捉える傾向があり、支配的な視覚的証拠と質問の意図に基づいて真実に推論を促進する真の因果メカニズムを発見できません。さらに、既存の方法は通常、イベントの一時性、因果関係、およびダイナミクスを共同でモデル化するために必要なクロスモーダルイベントレベルの理解を無視します。この作業では、視覚的および言語的モダリティの真の因果構造を発見するために因果介入法を導入することにより、新しい視点、つまりクロスモーダル因果関係推論からイベントレベルの視覚的質問応答に焦点を当てます。具体的には、クロスモーダル因果関係推論 (CMCIR) という名前の新しいイベントレベルの視覚的質問応答フレームワークを提案し、堅牢な因果関係を認識した視覚言語質問応答を実現します。クロスモーダル因果構造を発見するために、Causality-aware Visual-Linguistic Reasoning (CVLR) モジュールが提案されており、フロントドアとバックドアの因果的介入を介して、視覚的および言語的な偽の相関関係を共同で解きほぐします。言語セマンティクスと時空間表現の間のきめの細かい相互作用をモデル化するために、視覚コンテンツと言語コンテンツの間のマルチモーダル共起相互作用を作成する時空間トランスフォーマー (STT) を構築します。因果関係の視覚的および言語的機能を適応的に融合するために、グローバルな意味認識型視覚言語表現を適応的に学習するためのガイダンスとして、階層的な言語的意味関係を活用する視覚言語機能融合 (VLFF) モジュールを導入します。 4つのイベントレベルのデータセットに関する広範な実験により、視覚言語的因果構造の発見と堅牢なイベントレベルの視覚的質問応答の実現におけるCMCIRの優位性が実証されました。

Existing visual question answering methods tend to capture the cross-modal spurious correlations and fail to discover the true causal mechanism that facilitates reasoning truthfully based on the dominant visual evidence and the question intention. Additionally, the existing methods usually ignore the cross-modal event-level understanding that requires to jointly model event temporality, causality, and dynamics. In this work, we focus on event-level visual question answering from a new perspective, i.e., cross-modal causal relational reasoning, by introducing causal intervention methods to discover the true causal structures for visual and linguistic modalities. Specifically, we propose a novel event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust causality-aware visual-linguistic question answering. To discover cross-modal causal structures, the Causality-aware Visual-Linguistic Reasoning (CVLR) module is proposed to collaboratively disentangle the visual and linguistic spurious correlations via front-door and back-door causal interventions. To model the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build a Spatial-Temporal Transformer (STT) that creates multi-modal co-occurrence interactions between visual and linguistic content. To adaptively fuse the causality-ware visual and linguistic features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that leverages the hierarchical linguistic semantic relations as the guidance to learn the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering.

updated: Wed Apr 19 2023 03:47:14 GMT+0000 (UTC)

published: Tue Jul 26 2022 04:25:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト