Learning to Prompt for Vision-Language Models

Kaiyang Zhou; Jingkang Yang; Chen Change Loy; Ziwei Liu

視覚言語モデルのプロンプトを学習する

CLIP のような大規模な事前トレーニング済みのビジョン言語モデルは、幅広いダウンストリームタスクに転送可能な表現の学習において大きな可能性を示しています。主に離散化されたラベルに基づく従来の表現学習とは異なり、ビジョン言語の事前トレーニングは、画像とテキストを共通の特徴空間に配置します。これにより、プロンプトを介して下流のタスクにゼロショット転送が可能になります。つまり、分類の重みはから合成されます。関心のあるクラスを記述する自然言語。この作業では、そのようなモデルを実際に展開するための主要な課題は迅速なエンジニアリングであることを示しています。これにはドメインの専門知識が必要であり、非常に時間がかかります。言葉遣いがわずかに変更されるため、言葉の調整にかなりの時間を費やす必要があります。パフォーマンスに大きな影響を与える可能性があります。自然言語処理 (NLP) における即時学習研究の最近の進歩に着想を得て、コンテキスト最適化 (CoOp) を提案します。これは、特に CLIP のような視覚言語モデルを下流の画像認識に適応させるための単純なアプローチです。具体的には、CoOp は学習可能なベクトルを使用してプロンプトのコンテキストワードをモデル化し、事前トレーニング済みのパラメーター全体を固定したままにします。さまざまな画像認識タスクを処理するために、CoOp の 2 つの実装 (統合コンテキストとクラス固有のコンテキスト) を提供します。 11 のデータセットでの広範な実験を通じて、CoOp が手作りのプロンプトをまともなマージンで打ち負かすには、わずか 1 つか 2 つのショットしか必要とせず、より多くのショット (たとえば、平均 16 ショット) でプロンプトエンジニアリングよりも大幅な改善を得ることができることを示しています。ゲインは約 15% です (最高で 45% 以上に達します)。学習ベースのアプローチであるにもかかわらず、CoOp は、手作りのプロンプトを使用するゼロショットモデルと比較して、優れたドメイン一般化パフォーマンスを実現します。

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

updated: Thu Oct 06 2022 11:36:09 GMT+0000 (UTC)

published: Thu Sep 02 2021 17:57:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト