MixGen: A New Multi-Modal Data Augmentation

Xiaoshuai Hao; Yi Zhu; Srikar Appalaraju; Aston Zhang; Wanqian Zhang; Bo Li; Mu Li

MixGen: 新しいマルチモーダルデータ増強

深層学習のデータ効率を高めるには、データ拡張が必要です。視覚言語の事前トレーニングの場合、データは以前の作品の画像またはテキストのいずれかに対してのみ拡張されます。このホワイトペーパーでは、MixGen を紹介します。これは、データ効率をさらに向上させるための視覚言語表現学習のための共同データ拡張です。画像を補間し、テキストを連結することによって保持された意味的関係を持つ新しい画像とテキストのペアを生成します。シンプルで、既存のパイプラインにプラグアンドプレイできます。 5 つのダウンストリームビジョン言語タスクにわたって、CLIP、ViLT、ALBEF、TCL を含む 4 つのアーキテクチャで MixGen を評価し、その汎用性と有効性を示します。たとえば、ALBEF の事前トレーニングに MixGen を追加すると、ダウンストリームタスクの絶対的なパフォーマンスが向上します。画像とテキストの検索 (微調整された COCO で +6.2%、Flicker30K ゼロショットで +5.3%)、ビジュアルグラウンディング (上で +0.9%)。 RefCOCO+)、視覚的推論 (NLVR2 で +0.9%)、視覚的質問応答 (VQA2.0 で +0.3%)、視覚含意 (SNLI-VE で +0.4%)。

Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3% on Flicker30K zero-shot), visual grounding (+0.9% on RefCOCO+), visual reasoning (+$0.9% on NLVR2), visual question answering (+0.3% on VQA2.0), and visual entailment (+0.4% on SNLI-VE).

updated: Mon Jan 09 2023 22:26:06 GMT+0000 (UTC)

published: Thu Jun 16 2022 17:58:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト