Understanding and Improving Visual Prompting: A Label-Mapping Perspective

Aochuan Chen; Yuguang Yao; Pin-Yu Chen; Yihua Zhang; Sijia Liu

ビジュアルプロンプトの理解と改善: ラベルマッピングの視点

ビジョンタスクの入力プロンプト手法であるビジュアルプロンプト (VP) を再検討し、進歩させます。 VP は、ユニバーサルプロンプト (入力摂動パターンの観点から) をダウンストリームデータポイントに組み込むだけで、固定された事前トレーニング済みのソースモデルを再プログラムして、ターゲットドメインでダウンストリームタスクを実行できます。しかし、ソースクラスとターゲットクラスの間のルールのないラベルマッピング (LM) が与えられたとしても、なぜ VP が効果的であり続けるのかはわかりにくいままです。上記に触発されて、私たちは質問します: LM は VP とどのように関連していますか?そして、そのような関係をどのように活用して、対象タスクの精度を向上させるのでしょうか? VP に対する LM の影響を詳しく調べ、LM の「品質」 (マッピングの精度と説明によって評価) が向上すると、VP の有効性が一貫して改善されるという肯定的な回答を提供します。これは、ＬＭの係数が欠けていた従来技術とは対照的である。 LM を最適化するために、ILM-VP (反復ラベルマッピングベースのビジュアルプロンプト) と呼ばれる新しい VP フレームワークを提案します。これは、ソースラベルをターゲットラベルに自動的に再マッピングし、VP のターゲットタスクの精度を徐々に向上させます。さらに、対照的な言語イメージの事前学習済み (CLIP) モデルを使用する場合、LM プロセスを統合して、CLIP のテキストプロンプト選択を支援し、ターゲットタスクの精度を向上させることを提案します。広範な実験により、私たちの提案が最先端の VP メソッドよりも大幅に優れていることが示されています。以下に強調表示されているように、ImageNet で事前トレーニングされた ResNet-18 を 13 のターゲットタスクに再プログラミングすると、ターゲットの Flowers102 および CIFAR100 データセットへの転移学習で 7.9% および 6.7% の精度向上など、大幅な差でベースラインよりも優れたパフォーマンスが得られることがわかります。さらに、CLIP ベースの VP に関する私たちの提案は、Flowers102 と DTD でそれぞれ 13.7% と 7.1% の精度向上を提供します。コードは https://github.com/OPTML-Group/ILM-VP で入手できます。

We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better 'quality' of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP. Further, when using a contrastive language-image pretrained (CLIP) model, we propose to integrate an LM process to assist the text prompt selection of CLIP and to improve the target task accuracy. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VP methods. As highlighted below, we show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and 6.7% accuracy improvements in transfer learning to the target Flowers102 and CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and 7.1% accuracy improvements on Flowers102 and DTD respectively. Our code is available at https://github.com/OPTML-Group/ILM-VP.

updated: Fri Mar 10 2023 22:54:45 GMT+0000 (UTC)

published: Mon Nov 21 2022 16:49:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト