Multi-Prompt with Depth Partitioned Cross-Modal Learning

Yingjie Tian; Yiqi Wang; Xianda Guo; Zheng Zhu; Long Chen

深度分割されたマルチプロンプトによるクロスモーダル学習

近年、さまざまな下流タスク向けに大規模な視覚言語の事前トレーニング済みモデルを微調整するためのソフトプロンプト学習手法が提案されています。これらのメソッドは通常、学習可能なテキストトークンと、固定パラメーターを持つモデルの入力としてのクラストークンを組み合わせます。ただし、多くの場合、クラスのコンテキストを説明するために単一のプロンプトが使用され、カテゴリの多様な属性を適切に把握できません。この研究では、ソフトプロンプトを単一の学習可能なプロンプトから複数のプロンプトに拡張するマルチモーダルプロンプト手法である、パーティション化マルチモーダルプロンプト (PMPO) を紹介します。私たちの方法は、視覚エンコーダーの深さを分割し、学習可能なプロンプトを分離された視覚深さに接続し、さまざまなプロンプトが視覚表現の階層的なコンテキストの深さをキャプチャできるようにします。さらに、マルチプロンプト学習の利点を最大限に活用するために、手動で設計されたテンプレートと学習可能なマルチプロンプトからの事前情報を組み込み、アプローチの一般化機能を向上させます。新しいクラスの一般化、データセット間の評価、ドメインの一般化という 3 つの困難なタスクに対するアプローチの有効性を評価します。たとえば、私たちの方法は、11 の多様な画像認識データセットを平均して 79.28 の調和平均を達成し (CoOp と比較して +7.62)、最先端のプロンプト方法と比較して顕著な競争力を示しています。

In recent years, soft prompt learning methods have been proposed to fine-tune large-scale vision-language pre-trained models for various downstream tasks. These methods typically combine learnable textual tokens with class tokens as input for models with frozen parameters. However, they often employ a single prompt to describe class contexts, failing to capture categories' diverse attributes adequately. This study introduces the Partitioned Multi-modal Prompt (PMPO), a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts. Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture the hierarchical contextual depths of visual representations. Furthermore, to maximize the advantages of multi-prompt learning, we incorporate prior information from manually designed templates and learnable multi-prompts, thus improving the generalization capabilities of our approach. We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization. For instance, our method achieves a 79.28 harmonic mean, averaged over 11 diverse image recognition datasets (+7.62 compared to CoOp), demonstrating significant competitiveness compared to state-of-the-art prompting methods.

updated: Tue Sep 05 2023 01:56:58 GMT+0000 (UTC)

published: Wed May 10 2023 14:54:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト