CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Peng Gao; Shijie Geng; Renrui Zhang; Teli Ma; Rongyao Fang; Yongfeng Zhang; Hongsheng Li; Yu Qiao

CLIP-Adapter：機能アダプターを備えたより良いビジョン言語モデル

大規模な対照的な視覚言語の事前トレーニングは、視覚表現学習において大きな進歩を示しています。個別のラベルの固定セットによってトレーニングされた従来の視覚システムとは異なり、新しいパラダイムがradford2021learningに導入され、オープンボキャブラリー設定で画像を生のテキストに揃えることを直接学習します。ダウンストリームタスクでは、慎重に選択されたテキストプロンプトを使用してゼロショット予測を行います。〜重要なプロンプトエンジニアリングを回避するために、コンテキスト最適化zhou2021coopが提案され、数ショットのトレーニング例を使用してタスク固有のプロンプトとして連続ベクトルを学習します。この論文では、プロンプトチューニング以外に、より良い視覚言語モデルを実現するための代替パスがあることを示します。〜プロンプトチューニングはテキスト入力用ですが、CLIP-Adapterを提案して、どちらかのビジュアルの機能アダプターを使用して微調整を行います。または言語ブランチ。具体的には、CLIP-Adapterは、新しい機能を学習するために追加のボトルネックレイヤーを採用し、元の事前トレーニング済み機能との残差スタイルの機能ブレンドを実行します。その結果、CLIP-Adapterは、シンプルな設計を維持しながら、コンテキスト最適化よりも優れたパフォーマンスを発揮できます。さまざまな視覚的分類タスクに関する実験と広範なアブレーション研究は、私たちのアプローチの有効性を示しています。

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in radford2021learning to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization zhou2021coop has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.

updated: Sat Oct 09 2021 11:39:30 GMT+0000 (UTC)

published: Sat Oct 09 2021 11:39:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト