DPL: Decoupled Prompt Learning for Vision-Language Models

Chen Xu; Yuhan Zhu; Guozhen Zhang; Haocheng Shen; Yixuan Liao; Xiaoxin Chen; Gangshan Wu; Limin Wang

DPL: 視覚言語モデルの分離された即時学習

迅速な学習は、基本的な視覚言語モデル (CLIP など) を下流のタスクに転送するための効率的かつ効果的なアプローチとして浮上しました。ただし、現在の方法は、既知のカテゴリに過剰適合する傾向があり、それによって、未見のクラスに対する一般化能力が制限されます。この論文では、この問題を軽減するためにプロンプト学習における注意を再定式化する新しい方法、分離プロンプト学習 (DPL) を提案します。具体的には、元の自己注意を 4 つの別個のサブプロセスに再定式化することで、プロンプトとインスタンス (つまり、画像パッチ/テキストトークン) の間の協調プロセスを理論的に調査します。詳細な分析を通じて、いくつかの近似手法によって特定のサブプロセスを強化して堅牢性と一般化可能性を強化できることがわかりました。さらに、テキスト入力の一般化を自然に維持するために、分離された注意に基づいて言語条件付きテキストプロンプトを導入します。私たちのアプローチは視覚的モダリティとテキストモダリティの両方に柔軟であり、マルチモーダルな即時学習に簡単に拡張できます。提案された手法を組み合わせることで、私たちのアプローチは、パラメーター効率を維持しながら、15 の画像認識データセットを含む 3 つの代表的なベンチマークで最先端のパフォーマンスを達成します。さらに、私たちの DPL は補助的な正則化タスクや追加のトレーニングデータに依存していないため、その優れた一般化能力がさらに実証されています。

Prompt learning has emerged as an efficient and effective approach for transferring foundational Vision-Language Models (e.g., CLIP) to downstream tasks. However, current methods tend to overfit to seen categories, thereby limiting their generalization ability for unseen classes. In this paper, we propose a new method, Decoupled Prompt Learning (DPL), which reformulates the attention in prompt learning to alleviate this problem. Specifically, we theoretically investigate the collaborative process between prompts and instances (i.e., image patches/text tokens) by reformulating the original self-attention into four separate sub-processes. Through detailed analysis, we observe that certain sub-processes can be strengthened to bolster robustness and generalizability by some approximation techniques. Furthermore, we introduce language-conditioned textual prompting based on decoupled attention to naturally preserve the generalization of text input. Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning. By combining the proposed techniques, our approach achieves state-of-the-art performance on three representative benchmarks encompassing 15 image recognition datasets, while maintaining parameter-efficient. Moreover, our DPL does not rely on any auxiliary regularization task or extra training data, further demonstrating its remarkable generalization ability.

updated: Sat Aug 19 2023 15:48:38 GMT+0000 (UTC)

published: Sat Aug 19 2023 15:48:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト