Conditional Prompt Learning for Vision-Language Models

Kaiyang Zhou; Jingkang Yang; Chen Change Loy; Ziwei Liu

視覚言語モデルのための条件付きプロンプト学習

CLIPのような強力な事前トレーニング済みの視覚言語モデルの台頭により、これらのモデルをダウンストリームのデータセットに適応させる方法を調査することが不可欠になります。コンテキスト最適化（CoOp）という名前の最近提案された方法は、事前にトレーニングされた視覚言語モデルを適応させるために、視覚領域に迅速な学習の概念（NLPの最近の傾向）を導入します。具体的には、CoOpは、プロンプト内のコンテキストワードを学習可能なベクトルのセットに変換し、学習用のラベル付き画像をいくつか使用するだけで、集中的に調整された手動プロンプトよりも大幅に改善できます。私たちの研究では、CoOpの重大な問題を特定します。学習したコンテキストは、同じデータセット内のより広い見えないクラスに一般化できず、CoOpがトレーニング中に観察された基本クラスに適合しないことを示唆しています。この問題に対処するために、条件付きコンテキスト最適化（CoCoOp）を提案します。これは、各画像に対して入力条件付きトークン（ベクトル）を生成する軽量ニューラルネットワークをさらに学習することでCoOpを拡張します。 CoOpの静的プロンプトと比較して、動的プロンプトは各インスタンスに適応するため、クラスシフトの影響を受けにくくなっています。広範な実験は、CoCoOpがCoOpよりもはるかに優れて見えないクラスに一般化することを示しており、単一のデータセットを超えた有望な転送可能性を示しています。また、ドメインの一般化のパフォーマンスも向上します。コードはhttps://github.com/KaiyangZhou/CoOpで入手できます。

With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well. Code is available at https://github.com/KaiyangZhou/CoOp.

updated: Thu Mar 10 2022 18:59:41 GMT+0000 (UTC)

published: Thu Mar 10 2022 18:59:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト