Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Yangguang Li; Feng Liang; Lichen Zhao; Yufeng Cui; Wanli Ouyang; Jing Shao; Fengwei Yu; Junjie Yan

監督はどこにでも存在する：データ効率の良い対照的な言語-画像の事前トレーニングパラダイム

最近、大規模な対照的な言語-画像事前トレーニング（CLIP）は、その印象的なゼロショット認識能力と下流のタスクへの優れた転送性で前例のない注目を集めています。ただし、CLIPは非常にデータを大量に消費し、事前トレーニングに4億の画像とテキストのペアが必要なため、採用が制限されます。この作業は、この制限を緩和するために、新しいトレーニングパラダイムであるデータ効率の高いCLIP（DeCLIP）を提案します。画像とテキストのペアの間で広く行われている監視を注意深く利用することにより、De-CLIPが一般的な視覚的特徴をより効率的に学習できることを示します。単一の画像とテキストの対照的な監視を使用する代わりに、（1）各モダリティ内の自己監視を使用してデータの可能性を十分に活用します。（2）モダリティ全体のマルチビュー監視。（3）他の同様のペアからの最近傍監視。 DeCLIP-ResNet50は、本質的な監視の恩恵を受けて、ImageNetで60.4％のゼロショットtop1精度を達成できます。これは、7.1分の1のデータを使用しながら、CLIP-ResNet50を0.8％上回っています。 DeCLIP-ResNet50は、ダウンストリームタスクに転送されたときに、11のビジュアルデータセットのうち8つで対応するものよりも優れています。さらに、モデルのスケールアップとコンピューティングもフレームワークでうまく機能します。コード、データセット、モデルはhttps://github.com/Sense-GVT/DeCLIPでリリースされています。

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1 x fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.Our code, dataset and models are released at: https://github.com/Sense-GVT/DeCLIP

updated: Mon Oct 11 2021 12:17:32 GMT+0000 (UTC)

published: Mon Oct 11 2021 12:17:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト