Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model

Yinghui Xing; Qirui Wu; De Cheng; Shizhou Zhang; Guoqiang Liang; Yanning Zhang

ビジョン言語の事前トレーニング済みモデルのクラス対応ビジュアルプロンプトチューニング

CLIP のような大規模な事前トレーニング済みのビジョン言語モデルの出現により、転送可能な表現は、迅速な調整によって幅広いダウンストリームタスクに適応させることができます。プロンプトチューニングでは、事前トレーニング済みの視覚言語モデルの画像エンコーダとテキストエンコーダの両方に格納されている一般的な知識から、下流のタスクに有益な情報を調べようとします。 Context Optimization (CoOp) という名前の最近提案された方法は、学習可能なベクトルのセットを言語側からのテキストプロンプトとして導入しますが、テキストプロンプトを単独で調整しても、画像エンコーダーの計算された視覚的特徴に影響を与えることはできず、最適化につながりません。このホワイトペーパーでは、テキストプロンプトと、テキストエンコーダーと画像エンコーダーの両方の視覚的プロンプトを同時に学習することにより、デュアルモダリティプロンプトチューニングパラダイムを提案します。さらに、ビジュアルプロンプトをターゲットのビジュアルコンセプトにより集中させるために、テンプレートプロンプトの言語記述とビジュアルクラストークンの埋め込みとの間でクロスアテンションを実行することによって動的に生成される Class-Aware Visual Prompt Tuning (CAVPT) を提案します。私たちの方法は、事前にトレーニングされた大規模な視覚言語モデルを調整するための新しいパラダイムを提供し、8 つのデータセットに関する広範な実験結果は、提案された方法の有効性を示しています。私たちのコードは、補足資料で入手できます。

With the emergence of large pre-trained vison-language model like CLIP, transferrable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in both the image and text encoders of the pre-trained vision-language model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side, while tuning the text prompt alone can not affect the computed visual features of the image encoder, thus leading to sub-optimal. In this paper, we propose a dual modality prompt tuning paradigm through learning text prompts and visual prompts for both the text and image encoder simultaneously. In addition, to make the visual prompt concentrate more on the target visual concept, we propose Class-Aware Visual Prompt Tuning (CAVPT), which is generated dynamically by performing the cross attention between language descriptions of template prompts and visual class token embeddings. Our method provides a new paradigm for tuning the large pre-trained vision-language model and extensive experimental results on 8 datasets demonstrate the effectiveness of the proposed method. Our code is available in the supplementary materials.

updated: Wed Aug 17 2022 15:06:36 GMT+0000 (UTC)

published: Wed Aug 17 2022 15:06:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト