Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

Chunyu Xie; Jincheng Li; Heng Cai; Fanjing Kong; Xiaoyu Wu; Jianfei Song; Henrique Morimitsu; Lin Yao; Dexin Wang; Dawei Leng; Baochang Zhang; Xiangyang Ji; Yafeng Deng

Zero と R2D2: 大規模な中国のクロスモーダルベンチマークとビジョン言語フレームワーク

大規模なデータセットでのビジョン言語事前トレーニング (VLP) は、さまざまなダウンストリームタスクで最高のパフォーマンスを示しています。英語のコーパスを使用した利用可能なベンチマークが多数あるのとは対照的に、中国語のコーパスを使用した大規模な事前トレーニングデータセットとダウンストリームデータセットは、ほとんど調査されていません。この作業では、研究コミュニティ向けに ZERO という大規模で高品質な中国のクロスモーダルベンチマークを構築します。これには、現在最大の公開事前トレーニングデータセット ZERO-Corpus と、ダウンストリームタスク用の人間が注釈を付けた 5 つの微調整データセットが含まれています。 . ZERO-Corpus には 2 億 5,000 万の画像と 7 億 5,000 万のテキスト記述が含まれており、さらに 5 つの微調整データセットのうち 2 つは現在、中国のクロスモーダルダウンストリームタスク向けの最大のものでもあります。 ZERO ベンチマークに加えて、大規模なクロスモーダル学習のために、ターゲットガイド付き蒸留と機能ガイド付き蒸留 (R2D2) で強化された事前ランキング + ランキングメカニズムを備えた VLP フレームワークも開発しています。画像とテキストの個々の表現を学習するために、グローバルな対照的な事前ランキングが最初に導入されます。次に、これらのプリミティブ表現は、画像とテキストのクロスエンコーダーとテキストと画像のクロスエンコーダーを介して、きめの細かいランキング方式で融合されます。 R2D2の能力を高めるために、ターゲットガイド蒸留と機能ガイド蒸留がさらに提案されています。 ZERO-Corpus と R2D2 VLP フレームワークを使用して、画像とテキストの検索、画像とテキストの照合、画像キャプション、テキストから画像への変換など、5 つの広範なタスクカテゴリから 12 のダウンストリームデータセットで最先端のパフォーマンスを実現します。生成、およびゼロショット画像分類。データセット、モデル、およびコードは、https://github.com/yuxie11/R2D2 で入手できます。

Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese cross-modal benchmark named ZERO for the research community, which contains the currently largest public pre-training dataset ZERO-Corpus and five human-annotated fine-tuning datasets for downstream tasks. ZERO-Corpus contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the ZERO benchmark, we also develop a VLP framework with pre-Ranking + Ranking mechanism, boosted with target-guided Distillation and feature-guided Distillation (R2D2) for large-scale cross-modal learning. A global contrastive pre-ranking is first introduced to learn the individual representations of images and texts. These primitive representations are then fused in a fine-grained ranking manner via an image-text cross encoder and a text-image cross encoder. The target-guided distillation and feature-guided distillation are further proposed to enhance the capability of R2D2. With the ZERO-Corpus and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2

updated: Thu Nov 17 2022 10:18:14 GMT+0000 (UTC)

published: Sun May 08 2022 13:19:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト