WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

Sha Yuan; Zhao Shuai; Leng Jiahong; Xue Zhao; Zhao Hanyu; Tang Jie

WuDaoMM：事前トレーニングモデル用の大規模なマルチモーダルデータセット

ドメイン固有のモデルと比較して、ビジョン言語の事前トレーニングモデル（VLPM）は、高速な微調整プロセスを使用して、ダウンストリームタスクで優れたパフォーマンスを示しています。たとえば、ERNIE-ViL、Oscar、UNIMOは、均一なトランススタックアーキテクチャと大量の画像テキストペアデータを備えたトレーニング済みVLPMであり、画像テキスト参照（IRおよびTR）、視覚質問応答（IRおよびTR）などのダウンストリームタスクで優れた結果を達成します。 VQA）や画像キャプション（IC）など。トレーニングフェーズ中、VLPMには、大規模なトレーニングデータの需要を満たすために、常に複数の公開データセットの組み合わせが提供されます。ただし、サイズ、タスクタイプ、品質などのデータ分散が不均一であるため、モデルトレーニングに複数のデータセットを組み合わせて使用すると問題が発生する可能性があります。この作業では、WuDaoMMという名前の大規模なマルチモーダルコーパスを紹介します。このコーパスには、6億5,000万を超える画像とテキストのペアが含まれています。具体的には、画像とキャプションの相関が弱い複数のWebページから約6億ペアのデータが収集され、他の5,000万ペアの強い関連画像とテキストのペアがいくつかの高品質のグラフィックWebサイトから収集されます。また、500万の強い相関のある画像とテキストのペアを備えたWuDaoMMのベースバージョンもリリースします。これは、一般的なクロスモーダルモデルの事前トレーニングをサポートするのに十分です。さらに、データセットの有効性をテストするために、理解モデルと生成ビジョン言語（VL）モデルの両方をトレーニングしました。結果は、WuDaoMMがVLPMの効率的なデータセットとして、特にテキストから画像への生成タスクのモデルに適用できることを示しています。データはhttps://data.wudaoai.cnで公開されています

Compared with the domain-specific model, the vision-language pre-training models (VLPMs) have shown superior performance on downstream tasks with fast fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of large-scare training data. However, due to the unevenness of data distribution including size, task type and quality, using the mixture of multiple datasets for model training can be problematic. In this work, we introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs. Specifically, about 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation, and the other 50 million strong-related image-text pairs are collected from some high-quality graphic websites. We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training. Besides, we trained both an understanding and a generation vision-language (VL) model to test the dataset effectiveness. The results show that WuDaoMM can be applied as an efficient dataset for VLPMs, especially for the model in text-to-image generation task. The data is released at https://data.wudaoai.cn

updated: Tue Mar 22 2022 06:12:20 GMT+0000 (UTC)

published: Tue Mar 22 2022 06:12:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト