Prompt Tuning with Soft Context Sharing for Vision-Language Models

Kun Ding; Ying Wang; Pengzhang Liu; Qiang Yu; Haojian Zhang; Shiming Xiang; Chunhong Pan

視覚言語モデルのソフトコンテキスト共有による迅速なチューニング

ビジョン言語モデルは最近、多くのコンピュータービジョンタスクで大きな可能性を示しています。一方、以前の研究では、視覚言語モデル用に設計された迅速な調整が、強力なベースラインである線形プローブと比較して、少数ショットの画像認識で優れたパフォーマンスを獲得できることが示されています。実際のアプリケーションでは、特に特殊な領域で、多くの少数ショットタスクが相互に関連しています。しかし、そのような情報は前作では無視されています。マルチタスク学習によるタスク関係のモデル化は通常、パフォーマンスを向上させることができるという事実に触発されて、複数のターゲットの少数ショットタスクで事前トレーニング済みのビジョン言語モデルを微調整するための新しい方法 SoftCPT (迅速な調整のためのソフトコンテキスト共有) を提案します。、同時に。具体的には、事前定義されたタスク名と学習可能なメタプロンプトを入力として使用して、各タスクのプロンプトベクトルを生成するタスク共有メタネットワークを設計します。そのため、すべてのタスクのプロンプトベクトルはソフトな方法で共有されます。この共有メタネットワークのパラメーターとメタプロンプトベクトルは、すべてのターゲットタスクの共同トレーニングセットで調整されます。 3 つのマルチタスクフューズショットデータセットに関する広範な実験は、SoftCPT が代表的なシングルタスクプロンプトチューニングメソッド CoOp [78] よりも大幅に優れていることを示しており、視覚言語プロンプトチューニングにおけるマルチタスク学習の有効性を示唆しています。ソースコードとデータは公開されます。

Vision-language models have recently shown great potential on many computer vision tasks. Meanwhile, prior work demonstrates prompt tuning designed for vision-language models could acquire superior performance on few-shot image recognition compared to linear probe, a strong baseline. In real-world applications, many few-shot tasks are correlated, particularly in a specialized area. However, such information is ignored by previous work. Inspired by the fact that modeling task relationships by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to fine-tune pre-trained vision-language models on multiple target few-shot tasks, simultaneously. Specifically, we design a task-shared meta network to generate prompt vector for each task using pre-defined task name together with a learnable meta prompt as input. As such, the prompt vectors of all tasks will be shared in a soft manner. The parameters of this shared meta network as well as the meta prompt vector are tuned on the joint training set of all target tasks. Extensive experiments on three multi-task few-shot datasets show that SoftCPT outperforms the representative single-task prompt tuning method CoOp [78] by a large margin, implying the effectiveness of multi-task learning in vision-language prompt tuning. The source code and data will be made publicly available.

updated: Mon Aug 29 2022 10:19:10 GMT+0000 (UTC)

published: Mon Aug 29 2022 10:19:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト