Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Yinghui Xing; Qirui Wu; De Cheng; Shizhou Zhang; Guoqiang Liang; Peng Wang; Yanning Zhang

ビジョン言語の事前訓練済みモデルのデュアルモダリティプロンプトチューニング

CLIP のような大規模な事前トレーニング済みのビジョン言語モデルの出現により、転送可能な表現は、迅速な調整によって幅広いダウンストリームタスクに適応させることができます。プロンプトチューニングでは、事前トレーニング済みのモデルに格納されている一般的な知識から、ダウンストリームタスクに有益な情報を調べようとします。 Context Optimization (CoOp) という名前の最近提案された方法は、学習可能なベクトルのセットを言語側からのテキストプロンプトとして導入します。ただし、テキストプロンプトを単独で調整しても、合成された「分類子」のみを調整できますが、画像エンコーダーの計算された視覚的特徴は影響を受けないため、最適ではないソリューションにつながります。この論文では、テキストと視覚的なプロンプトを同時に学習することにより、新しいデュアルモダリティプロンプトチューニング（DPT）パラダイムを提案します。最終的な画像機能をターゲットのビジュアルコンセプトにより集中させるために、DPT でクラス対応ビジュアルプロンプトチューニング (CAVPT) スキームがさらに提案されています。クラス対応ビジュアルプロンプトは、テキストプロンプト間の相互注意を実行することによって動的に生成されます。機能とイメージパッチトークンの埋め込みを使用して、ダウンストリームタスク関連情報とビジュアルインスタンス情報の両方をエンコードします。 11 のデータセットに関する広範な実験結果は、提案された方法の有効性と一般化能力を示しています。コードは https://github.com/fanrena/DPT で入手できます。

With the emergence of large pre-trained vison-language model like CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in the pre-trained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side. However, tuning the text prompt alone can only adjust the synthesized "classifier", while the computed visual features of the image encoder can not be affected , thus leading to sub-optimal solutions. In this paper, we propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is further proposed in our DPT, where the class-aware visual prompt is generated dynamically by performing the cross attention between text prompts features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method. Our code is available in https://github.com/fanrena/DPT.

updated: Thu Feb 16 2023 12:33:22 GMT+0000 (UTC)

published: Wed Aug 17 2022 15:06:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト