Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models

Juncheng Li; Minghe Gao; Longhui Wei; Siliang Tang; Wenqiao Zhang; Mengze Li; Wei Ji; Qi Tian; Tat-Seng Chua; Yueting Zhuang

一般化可能な視覚言語モデルのための勾配規制メタプロンプト学習

最近出現したパラダイムであるプロンプトチューニングは、凍結された事前トレーニングを調整する「ソフトプロンプト」を学習することにより、強力なビジョン言語の事前トレーニングモデルをパラメーター (およびデータ) の効率的な方法で下流のタスクに適応させることを可能にします。モデル。効果的ではありますが、迅速な調整のパフォーマンスが初期化に敏感であり、適切な初期化を見つけるために時間のかかるプロセスが必要なため、事前トレーニングモデルの高速適応能力が制限される少数ショットのシナリオでは特に問題があります。さらに、プロンプトチューニングは、学習可能なプロンプトトークンが限られたトレーニングサンプルにオーバーフィットしやすいため、事前トレーニングモデルの一般化可能性を損なう可能性があります。これらの問題に対処するために、メタ学習における強力なクロスドメイン一般化可能性のための軽量勾配調整関数と、より良い適応のための効率的なソフトプロンプト初期化を共同でメタ学習する、新しい勾配調整メタプロンプト学習 (GRAM) フレームワークを導入します。ラベル付けされていない画像とテキストの事前トレーニングデータのみを使用するパラダイム。特定のプロンプトチューニング方法を設計するのではなく、当社の GRAM はモデルにとらわれない方法でさまざまなプロンプトチューニング方法に簡単に組み込むことができ、包括的な実験により、GRAM がいくつかの設定 (つまり、少数ショット学習、クロスドメインの一般化、クロスデータセットの一般化など) 11 のデータセット以上。さらに、実験では、GRAM を使用すると、テキストと視覚によるプロンプトチューニングの直交する方法が相互に強化された方法で機能し、ユニモーダルプロンプトチューニング方法よりも優れた一般化可能性が提供されることが示されています。

Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter -- and data -- efficient way, by learning the ``soft prompts'' to condition frozen pre-training models. Though effective, it is particularly problematic in the few-shot scenario, where prompt tuning performance is sensitive to the initialization and requires a time-consuming process to find a good initialization, thus restricting the fast adaptation ability of the pre-training models. In addition, prompt tuning could undermine the generalizability of the pre-training models, because the learnable prompt tokens are easy to overfit to the limited training samples. To address these issues, we introduce a novel Gradient-RegulAted Meta-prompt learning (GRAM) framework that jointly meta-learns an efficient soft prompt initialization for better adaptation and a lightweight gradient regulating function for strong cross-domain generalizability in a meta-learning paradigm using only the unlabeled image-text pre-training data. Rather than designing a specific prompt tuning method, our GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way, and comprehensive experiments show that GRAM brings about consistent improvement for them in several settings (i.e., few-shot learning, cross-domain generalization, cross-dataset generalization, etc.) over 11 datasets. Further, experiments show that GRAM enables the orthogonal methods of textual and visual prompt tuning to work in a mutually-enhanced way, offering better generalizability beyond the uni-modal prompt tuning methods.

updated: Sun Mar 12 2023 05:03:37 GMT+0000 (UTC)

published: Sun Mar 12 2023 05:03:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト