Multitask Vision-Language Prompt Tuning

Sheng Shen; Shijia Yang; Tianjun Zhang; Bohan Zhai; Joseph E. Gonzalez; Kurt Keutzer; Trevor Darrell

マルチタスクビジョン言語プロンプトチューニング

タスク固有の学習済みプロンプトベクトルの条件付けである Prompt Tuning は、大規模な事前トレーニング済みのビジョン言語モデルを複数のダウンストリームタスクに適応させるための、データ効率とパラメーター効率の高い方法として登場しました。ただし、既存のアプローチは通常、各タスクのプロンプトベクトルをゼロから個別に学習することを考慮しているため、さまざまな視覚言語タスク間で共有可能な豊富な知識を活用できません。この論文では、マルチタスクビジョン言語プロンプトチューニング (MVLPT) を提案します。MVLPT は、ビジョン言語モデルのプロンプトチューニングにクロスタスク知識を組み込みます。具体的には、（i）複数のソースタスクから単一の転送可能なプロンプトを学習して、各ターゲットタスクのプロンプトを初期化することの有効性を示します。 (ii) 多くのターゲットタスクがプロンプトベクトルを共有することで互いに利益を得ることができるため、マルチタスクプロンプトチューニングを介して共同で学習できることを示します。 3 つの代表的なプロンプトチューニング方法、つまり、テキストプロンプトチューニング、ビジュアルプロンプトチューニング、および統一された視覚言語プロンプトチューニングを使用して、提案された MVLPT のベンチマークを行います。 20 のビジョンタスクの結果は、提案されたアプローチがすべての単一タスクのベースラインプロンプトチューニング方法よりも優れていることを示しており、少数ショット ELEVATER ベンチマークとクロスタスク一般化ベンチマークで新しい最先端を設定しています。クロスタスクの知識が最も効果的な場所を理解するために、各迅速なチューニング方法で 400 の組み合わせで 20 のビジョンタスクを使用して、タスクの転送可能性に関する大規模な研究も行います。各プロンプトチューニングメソッドで最もパフォーマンスの高い MVLPT は、異なるタスクの組み合わせを優先し、視覚的な類似性とラベルの類似性に応じて、多くのタスクが互いに利益を得ることができることを示しています。コードは https://github.com/sIncerass/MVLPT で入手できます。

Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at https://github.com/sIncerass/MVLPT.

updated: Tue Nov 22 2022 07:24:16 GMT+0000 (UTC)

published: Mon Nov 21 2022 18:41:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト