Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

An Yang; Junshu Pan; Junyang Lin; Rui Men; Yichang Zhang; Jingren Zhou; Chang Zhou

中国語 CLIP: 中国語での対照的な視覚と言語の事前トレーニング

CLIP の大成功 (Radford et al., 2021) は、視覚言語事前訓練のための対照学習の研究と応用を促進しました。ただし、公開されている CLIP モデルのほとんどは英語のデータで事前トレーニングされていますが、中国語のデータで事前トレーニングされた CLIP を検索するのは困難です。以下の理由から、中国の CLIP の事前トレーニングは研究と産業にとって不可欠であると考えています。まず、中国語の視覚言語検索に役立ち、言語固有のマルチモーダル表現学習を促進できます。第 2 に、中国語の Web サイトでの画像の配布は、英語の Web サイトでの画像の配布とは異なる必要があります。この作業では、ほとんどのデータが公開されているデータセットから取得される中国語の画像とテキストのペアの大規模なデータセットを構築し、新しいデータセットで中国語の CLIP モデルを事前トレーニングします。 7,700 万から 9 億 5,800 万のパラメーターにまたがる、複数のサイズの 5 つの中国の CLIP モデルを開発しています。さらに、モデルのパフォーマンスを向上させるために、最初に画像エンコーダーを固定してモデルをトレーニングし、次にすべてのパラメーターを最適化してトレーニングする 2 段階の事前トレーニング方法を提案します。私たちの包括的な実験は、中国のCLIPがゼロショット学習と微調整のセットアップでMUGE、Flickr30K-CN、およびCOCO-CNで最先端のパフォーマンスを達成できることを実証し、ゼロショットで競争力のあるパフォーマンスを達成できることを示しています- ELEVATER ベンチマークでの評価に基づくショット画像分類 (Li et al., 2022)。さらに、アブレーション研究を通じて、2 段階の事前トレーニング方法が他のオプションと比較して最も効果的であることを示しています。 https://github.com/OFA-Sys/Chinese-CLIP でコードをリリースします

The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. However, while the publicly available CLIP models are mostly pretrained on English data, it is hard to search for a CLIP pretrained on Chinese data. We assume that pretraining a Chinese CLIP is essential to research and industry for the following reasons. First, it can benefit the vision-language retrieval in Chinese and thus promote the language-specific multimodal representation learning. Second, the distribution of images in Chinese websites should be different from that of images in English websites. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). Furthermore, through the ablation study we show that the two-stage pretraining method is the most effective compared with the other options. We release our code in https://github.com/OFA-Sys/Chinese-CLIP

updated: Wed Nov 02 2022 17:47:23 GMT+0000 (UTC)

published: Wed Nov 02 2022 17:47:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト