Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Yufeng Cui; Lichen Zhao; Feng Liang; Yangguang Li; Jing Shao

対照的な言語の民主化-画像の事前トレーニング：データ、モデル、および監視のCLIPベンチマーク

対照的な言語-画像事前トレーニング（CLIP）は、言語監視から視覚モデルを学習するための新しいパラダイムとして登場しました。研究者はCLIPのフロンティアを推進し続けていますが、これらの作品を再現することは依然として困難です。これは、研究者が一貫したトレーニングレシピを選択せず、異なるデータを使用することさえあり、異なる方法間の公正な比較を妨げるためです。この作業では、CLIPとそのバリアントを評価、分析、およびベンチマークする最初の試みであるCLIPベンチマークを提案します。データ、監視、モデルアーキテクチャの3つの主要な要素を包括的に分析します。かなりの直感的または反直感的な洞察が見つかります：（1）。データ品質はパフォーマンスに大きな影響を与えます。（2）。特定の監視は、畳み込みネットワーク（ConvNets）とビジョントランスフォーマー（ViT）に対して異なる効果をもたらします。より適切な監視を適用すると、CLIPのパフォーマンスを効果的に向上させることができます。（3）。テキストエンコーダーを削減すると、トレーニングコストは削減されますが、最終的なパフォーマンスにはあまり影響しません。さらに、DeCLIPとFILIPをさらに組み合わせて、最強のバリアントDeFILIPを実現します。 CLIPベンチマークは、将来のCLIP調査のために、https：//github.com/Sense-GVT/DeCLIPでリリースされます。

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is because researchers do not choose consistent training recipes and even use different data, hampering the fair comparison between different methods. In this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. We conduct a comprehensive analysis of three key factors: data, supervision, and model architecture. We find considerable intuitive or counter-intuitive insights: (1). Data quality has a significant impact on performance. (2). Certain supervision has different effects for Convolutional Networks (ConvNets) and Vision Transformers (ViT). Applying more proper supervision can effectively improve the performance of CLIP. (3). Curtailing the text encoder reduces the training cost but not much affect the final performance. Moreover, we further combine DeCLIP with FILIP, bringing us the strongest variant DeFILIP. The CLIP-benchmark would be released at: https://github.com/Sense-GVT/DeCLIP for future CLIP research.

updated: Fri Mar 11 2022 08:41:00 GMT+0000 (UTC)

published: Fri Mar 11 2022 08:41:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト