Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Qian Yang; Yunxin Li; Baotian Hu; Lin Ma; Yuxing Ding; Min Zhang

自然言語の説明による視覚的含意のためのチャンク認識アラインメントと語彙制約

自然言語の説明による視覚的含意は、テキストと画像のペア間の関係を推測し、意思決定プロセスを説明する文を生成することを目的としています。以前の方法は、主に事前に訓練された視覚言語モデルに依存して関係推論を実行し、言語モデルに対応する説明を生成します。ただし、事前にトレーニングされた視覚言語モデルは、主にテキストと画像の間のトークンレベルの配置を構築しますが、フレーズ（チャンク）と視覚コンテンツの間の高レベルの意味的配置を無視します。これは視覚言語の推論にとって重要です。さらに、エンコードされた結合表現のみに基づく説明ジェネレーターは、関係推論の重要な意思決定ポイントを明示的に考慮しません。したがって、生成された説明は視覚言語の推論にあまり忠実ではありません。これらの問題を軽減するために、CALeCと呼ばれる統合されたチャンク対応の配置および語彙制約ベースの方法を提案します。これには、チャンク対応のセマンティックインタラクター（CSI編）、関係推論機能、および語彙制約対応ジェネレーター（LeCG編）が含まれています。具体的には、CSIは、言語とさまざまな画像領域に固有の文型を利用して、チャンク対応のセマンティックアラインメントを構築します。 Relation Inferrerは、注意ベースの推論ネットワークを使用して、トークンレベルおよびチャンクレベルの視覚言語表現を組み込みます。 LeCGは、語彙制約を利用して、関係推論者が焦点を当てた単語またはチャンクを説明の生成に明示的に組み込み、説明の忠実性と有益性を向上させます。 3つのデータセットで広範な実験を実施しました。実験結果は、CALeCが、推論の精度と生成された説明の品質に関して、他の競合他社のモデルを大幅に上回っていることを示しています。

Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.

updated: Sat Jul 23 2022 03:19:50 GMT+0000 (UTC)

published: Sat Jul 23 2022 03:19:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト