The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Gi-Cheon Kang; Sungdong Kim; Jin-Hwa Kim; Donghyun Kwak; Byoung-Tak Zhang

The Dialog Must Go On: Generative Self-Training によるビジュアルダイアログの改善

ビジュアルダイアログ (VisDial) は、ダイアログの履歴をコンテキストとして使用して、画像に基づいた一連の質問に答えるタスクです。以前の作業では、教師あり学習を介して、または関連する視覚と言語のデータセットで事前トレーニングを活用して、VisDial データのみでダイアログエージェントをトレーニングしました。この論文では、生成的自己訓練 (GST) と呼ばれる、Web 上のラベルのない画像を活用する、視覚に基づいた対話のための半教師あり学習アプローチを紹介します。具体的には、GST は最初に配信外検出によってドメイン内の画像を取得し、マルチモーダルな条件付きテキスト生成によって画像に関する合成ダイアログを生成します。次に、GST は、合成および元の VisDial データでダイアログエージェントをトレーニングします。その結果、GST はトレーニングデータの量を VisDial の桁数 (1.2M から 12.9M の QA データ) までスケーリングします。合成ダイアログの堅牢なトレーニングのために、パープレキシティベースのデータ選択とマルチモーダル一貫性正則化も提案します。 VisDial v1.0 および v0.9 データセットでの評価は、GST が両方のデータセットで最新の結果を達成することを示しています。さらに、視覚的およびテキストによる敵対的攻撃に対する GST の堅牢性を観察します。最後に、GST は、データ量の少ない体制でパフォーマンスを大幅に向上させます。コードは https://github.com/gicheonkang/gst-visdial で入手できます。

Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the synthetic dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe the robustness of GST against both visual and textual adversarial attacks. Finally, GST yields strong performance gains in the low-data regime. Code is available at https://github.com/gicheonkang/gst-visdial.

updated: Thu Mar 02 2023 12:33:10 GMT+0000 (UTC)

published: Wed May 25 2022 05:40:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト