CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

Aman Shrivastava; Ramprasaath R. Selvaraju; Nikhil Naik; Vicente Ordonez

CLIP-Lite: 言語監督による情報効率の高い視覚表現学習

私たちは、テキスト注釈との特徴の位置合わせによる視覚表現学習のための情報効率の高い方法である CLIP-Lite を提案します。以前に提案された CLIP モデルと比較して、CLIP-Lite は、対照学習目標の最適化中に、ポジティブな画像とテキストのサンプルごとにネガティブな画像とテキストのサンプルのペアを 1 つだけ必要とします。これは、情報効率の下限を利用して 2 つの入力モダリティ間の相互情報を最大化することで実現されます。これにより、同じ規模の CLIP よりも優れたパフォーマンスを実現しながら、大幅に削減されたデータ量とバッチサイズで CLIP-Lite をトレーニングできます。 COCO-Captions データセットで事前トレーニングし、他のデータセットへの転移学習をテストすることで CLIP-Lite を評価します。 CLIP-Lite は、Pascal VOC 分類で +14.0% の mAP 絶対的なパフォーマンス向上、ImageNet で +22.1% のトップ 1 精度向上を達成しながら、他のより複雑なテキスト監視モデルと同等またはそれ以上の性能を発揮します。 CLIP-Lite は、画像とテキストの検索、ゼロショット分類、視覚的根拠付けにおいても CLIP よりも優れています。最後に、CLIP-Lite が言語セマンティクスを活用して、下流のタスクで使用できるバイアスのない視覚表現を促進できることを示します。実装: https://github.com/4m4n5/CLIP-Lite

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite

updated: Thu May 11 2023 13:47:42 GMT+0000 (UTC)

published: Tue Dec 14 2021 03:08:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト