NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

Fawaz Sammani; Tanmoy Mukherjee; Nikos Deligiannis

NLX-GPT：ビジョンおよびビジョン言語タスクにおける自然言語説明のモデル

自然言語説明（NLE）モデルは、人間にやさしく、高レベルで、きめの細かい自然言語文を生成することにより、ブラックボックスシステムの意思決定プロセスを説明することを目的としています。現在のNLEモデルは、GPTなどの言語モデル（説明モデル）を介して、VQAモデルなどのビジョンまたはビジョン言語モデル（別名、タスクモデル）の意思決定プロセスを説明します。タスクモデルに必要な追加のメモリリソースと推論時間を除いて、タスクモデルと説明モデルは完全に独立しているため、説明と回答を予測するために行われた推論プロセスとの関連付けが解除されます。答えを予測し、それを説明することができる、一般的でコンパクトで忠実な言語モデルであるNLX-GPTを紹介します。まず、画像の一般的な理解のために、画像とキャプションのペアの大規模なデータに対して事前トレーニングを行い、次に、説明とともにテキスト予測タスクとして回答を作成します。リージョンの提案やタスクモデルがない場合、結果として得られる全体的なフレームワークは、より良い評価スコアを達成し、含まれるパラメーターがはるかに少なく、現在のSoAモデルよりも15倍高速です。次に、説明を評価する問題に対処します。これは、多くの場合、一般的で、データに偏りがあり、いくつかの形式で発生する可能性があります。したがって、2つの新しい評価尺度を設計します。（1）explain-predictと（2）検索ベースの攻撃、ラベルを必要としない自己評価フレームワークです。コードはhttps://github.com/fawazsammani/nlxgptにあります。

Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system via generating natural language sentences which are human-friendly, high-level and fine-grained. Current NLE models explain the decision-making process of a vision or vision-language model (a.k.a., task model), e.g., a VQA model, via a language model (a.k.a., explanation model), e.g., GPT. Other than the additional memory resources and inference time required by the task model, the task and explanation models are completely independent, which disassociates the explanation from the reasoning process made to predict the answer. We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it. We first conduct pre-training on large scale data of image-caption pairs for general understanding of images, and then formulate the answer as a text prediction task along with the explanation. Without region proposals nor a task model, our resulting overall framework attains better evaluation scores, contains much less parameters and is 15× faster than the current SoA model. We then address the problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms. We therefore design 2 new evaluation measures: (1) explain-predict and (2) retrieval-based attack, a self-evaluation framework that requires no labels. Code is at: https://github.com/fawazsammani/nlxgpt.

updated: Wed Mar 09 2022 22:57:15 GMT+0000 (UTC)

published: Wed Mar 09 2022 22:57:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト