Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Jiaxi Gu; Xiaojun Meng; Guansong Lu; Lu Hou; Minzhe Niu; Xiaodan Liang; Lewei Yao; Runhui Huang; Wei Zhang; Xin Jiang; Chunjing Xu; Hang Xu

Wukong: 1 億の大規模な中国のクロスモーダル事前トレーニングベンチマーク

Vision-Language Pre-training (VLP) モデルは、さまざまなダウンストリームタスクで優れたパフォーマンスを示しています。彼らの成功は、事前にトレーニングされたクロスモーダルデータセットの規模に大きく依存しています。ただし、中国語の大規模なデータセットとベンチマークの欠如は、中国語の VLP モデルとより広範な多言語アプリケーションの開発を妨げています。この作業では、Web から収集された 1 億の中国の画像とテキストのペアを含む、Wukong という大規模な中国のクロスモーダルデータセットをリリースします。 Wukong は、VLP の研究とコミュニティ開発を促進するために、さまざまなマルチモーダル事前トレーニング方法のベンチマークを行うことを目指しています。さらに、さまざまな画像エンコーダー (ViT-B/ViT-L/SwinT) で事前トレーニングされたモデルのグループをリリースし、ロックされた画像テキストの調整、対照的なトークン単位の類似性などの高度な事前トレーニング手法を VLP に適用します。学習、および削減されたトークンの相互作用。人間が検証した新しい最大の画像テキストテストデータセットを含むさまざまなダウンストリームタスクの広範な実験とベンチマークも提供されます。実験は、Wukong が有望な中国語の事前トレーニングデータセットおよびさまざまなクロスモーダル学習方法のベンチマークとして機能できることを示しています。 10 個のデータセットに対するゼロショット画像分類タスクの場合、Wukong_ViT-L は 73.03% の平均精度を達成します。画像テキスト検索タスクでは、AIC-ICC で 71.6% の平均再現率を達成し、WenLan 2.0 よりも 12.9% 高くなります。また、Wukong モデルは、Flickr8K-CN、Flickr-30K-CN、COCO-CN など、複数のデータセットの他のバリアントを使用してダウンストリームタスクでベンチマークされています。詳細については、https://wukong-dataset.github.io/wukong-dataset/ を参照してください。

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, Wukong_ViT-L achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.

updated: Thu Sep 29 2022 03:37:02 GMT+0000 (UTC)

published: Mon Feb 14 2022 14:37:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト