Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

Zhiwei Zhang; Yuliang Liu

説明責任のあるテキストとビジュアルのチャットは、画像の再作成における人間の指示を拒否することを学びます

ChatGPT と GPT-4 の最近の成功により、マルチモーダル対話システムに対する幅広い注目が集まりました。しかし、学術コミュニティには、テキストとビジュアルのチャットタスクにおけるビジュアル言語モデル (VLM) のマルチモーダル生成機能を検証できるデータセットがありません。この論文では、合成 CLEVR-ATVC データセット (620K) と手動で撮影した Fruit-ATVC データセット (50K) という 2 つの新しいマルチモーダルデータセットを構築します。どちらもビジュアルおよびテキストベースの入出力を備えています。さらに、言語ベースの ChatGPT 会話のように、マルチモーダルシステムが人間の要求を拒否できるようにする (つまり、説明責任を示す) ために、特定のルールを開発し、監視信号としてデータセットに組み込みます。これにより、訓練された VLM は、視覚的およびテキストによる推論の後に、人間の指示を実行できない理由についての言語説明を伴って、はいまたはいいえの回答を提供することができます。私たちの方法では、画像自動エンコーダーと自己回帰変換器を最初からトレーニングするための 2 状態トレーニング手順を提案します。最初の状態では、離散変分オートエンコーダ (dVAE) を使用して各画像を短いトークンに圧縮し、その後、単一のデータストリームとしてテキストトークンと連結して、デコーダベースのトランスフォーマに供給して、視覚的な再現とテキストフィードバックを生成します。 2番目の状態。再現された画質、回答精度、不確実性や不完全なユーザークエリに直面したときのモデルの動作の観点から、実験結果の包括的な分析を提供します。私たちの調査と発見が、テキストとビジュアルの生成モデルの説明責任に関する貴重な洞察に貢献することを願っています。

The recent success of ChatGPT and GPT-4 has drawn widespread attention to multimodal dialogue systems. However, the academia community lacks a dataset that can validate the multimodal generation capabilities of Visual Language Models (VLMs) in textual-visual chat tasks. In this paper, we construct two new multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K), both featuring visual and text-based inputs and outputs. Additionally, to enable the multimodal system to reject human requests (i.e., demonstrate accountability), as in language-based ChatGPT conversations, we develop and incorporate specific rules into the datasets as supervisory signals. This allows the trained VLM to provide a yes or no answer after visual and textual reasoning, accompanied by a language explanation as to why the human instruction cannot be excuted. In our method, we propose a two-state training procedure to train the image auto-encoder and auto-regressive transformer from scratch. The first state involves a discrete variational autoencoder (dVAE) to compress each image into short tokens, which are then concatenated with text tokens as a single data stream to be fed into the decoder-based transformer for generating visual re-creation and textual feedback in the second state. We provide comprehensive analyses of experimental results in terms of re-created image quality, answer accuracy, and the model behavior when faced with uncertainty and imperfect user queries. We hope our explorations and findings contribute valuable insights regarding the accountability of textual-visual generative models.

updated: Wed Jun 14 2023 16:03:55 GMT+0000 (UTC)

published: Fri Mar 10 2023 15:35:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト