Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

Jiaxi Gu; Xiaojun Meng; Guansong Lu; Lu Hou; Minzhe Niu; Hang Xu; Xiaodan Liang; Wei Zhang; Xin Jiang; Chunjing Xu

Wukong：1億の大規模な中国のクロスモーダル事前トレーニングデータセットと基盤フレームワーク

このホワイトペーパーでは、ビジョン言語事前トレーニング（VLP）の研究とコミュニティ開発を促進するために、さまざまなマルチモーダル事前トレーニング方法のベンチマークを行うための大規模な中国語クロスモーダルデータセットを紹介します。 CLIP、ALIGN、FILIPなどの最近のデュアルストリームVLPモデルは、さまざまなダウンストリームタスクで優れたパフォーマンスを示し、オープンドメインタスクでの優れたゼロショット能力を示しています。ただし、それらの成功は、事前にトレーニングされたデータセットの規模に大きく依存しています。 Flickr30k、CC12Mなどの小規模なビジョン言語の英語データセットと大規模なLAION-400Mの両方がありますが、現在のコミュニティには中国語の大規模なビジョン言語のベンチマークがなく、より広範な多言語アプリケーションの開発が妨げられています。一方、公開されている大規模な中国のクロスモーダル事前トレーニングデータセットは非常にまれであり、事前トレーニング済みモデルをダウンストリームタスクのサービスとして使用することは困難です。この作業では、Webからの1億の中国語の画像とテキストのペアを含む、Wukongという名前の大規模な中国語のクロスモーダルデータセットをリリースします。さらに、高度な画像エンコーダー（ResNet / ViT / SwinT）とさまざまな事前トレーニング方法（CLIP / FILIP / LiT）で事前トレーニングされたビッグモデルのグループをリリースします。広範な実験、さまざまなダウンストリームタスクの詳細なベンチマーク、およびいくつかのエキサイティングな調査結果を提供します。実験によると、Wukongは、さまざまなクロスモーダル学習方法の有望な中国語の事前トレーニングデータセットおよびベンチマークとして機能し、ゼロショット画像分類や画像テキスト検索ベンチマークなどのさまざまなダウンストリームタスクで優れたパフォーマンスを発揮します。詳細については、https：//wukong-dataset.github.io/wukong-dataset/を参照してください。

This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods to facilitate the Vision-Language Pre-training (VLP) research and community development. Recent dual-stream VLP models like CLIP, ALIGN and FILIP have shown remarkable performance on various downstream tasks as well as their remarkable zero-shot ability in the open domain tasks. However, their success heavily relies on the scale of pre-trained datasets. Though there have been both small-scale vision-language English datasets like Flickr30k, CC12M as well as large-scale LAION-400M, the current community lacks large-scale Vision-Language benchmarks in Chinese, hindering the development of broader multilingual applications. On the other hand, there is very rare publicly available large-scale Chinese cross-modal pre-training dataset that has been released, making it hard to use pre-trained models as services for downstream tasks. In this work, we release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web. Furthermore, we release a group of big models pre-trained with advanced image encoders (ResNet/ViT/SwinT) and different pre-training methods (CLIP/FILIP/LiT). We provide extensive experiments, a deep benchmarking of different downstream tasks, and some exciting findings. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods, which gives superior performance on various downstream tasks such as zero-shot image classification and image-text retrieval benchmarks. More information can refer to https://wukong-dataset.github.io/wukong-dataset/.

updated: Mon Feb 14 2022 14:37:15 GMT+0000 (UTC)

published: Mon Feb 14 2022 14:37:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト