PVP: Pre-trained Visual Parameter-Efficient Tuning

Zhao Song; Ke Yang; Naiyang Guan; Junjie Zhu; Peng Qiao; Qingyong Hu

PVP: 事前トレーニング済みのビジュアルパラメータの効率的な調整

大規模な事前学習済み変換器は、さまざまなコンピュータービジョンタスクで目覚ましい成功を収めています。ただし、計算コストとストレージコストが高いため、ダウンストリームタスク用にこれらのモデルを完全に微調整することは依然として非常に困難です。最近、Visual Prompt Tuning (VPT) や Low-Rank Adaptation (LoRA) などの Parameter-Efficient Tuning (PETuning) 手法により、事前トレーニング済みモデルに軽量のプロンプトモジュールを挿入してこれらを調整することで、計算とストレージのコストが大幅に削減されました。トランスフォーマーのバックボーンを凍結したまま、少数のトレーニング可能なパラメーターでモジュールをプロンプト表示します。調整する必要があるパラメーターはわずかですが、ほとんどの PETuning メソッドでは、良好な結果を得るために大量のダウンストリームタスクトレーニングデータが必要です。特にクラスごとに 1 つまたは 2 つの例しかない場合、データが少ない体制ではパフォーマンスが不十分です。この目的のために、パフォーマンスの低下は主にプロンプトモジュールを初期化する不適切な方法によるものであることを経験的に特定します。これは、事前トレーニング済みの言語モデルでも検証されています。次に、事前にトレーニングされたビジュアルパラメーター効率の高い (PVP) チューニングフレームワークを提案します。これは、最初にパラメーター効率の高いチューニングモジュールを事前トレーニングし、次に事前トレーニング済みのモジュールと事前トレーニング済みのトランスフォーマーバックボーンを活用して、パラメーター効率の高いパフォーマンスを実行します。ダウンストリームタスクのチューニング。 5 つの細粒度視覚分類 (FGVC) および VTAB-1k データセットに関する実験結果は、提案された方法が最先端の PETuning 方法よりも大幅に優れていることを示しています。

Large-scale pre-trained transformers have demonstrated remarkable success in various computer vision tasks. However, it is still highly challenging to fully fine-tune these models for downstream tasks due to their high computational and storage costs. Recently, Parameter-Efficient Tuning (PETuning) techniques, e.g., Visual Prompt Tuning (VPT) and Low-Rank Adaptation (LoRA), have significantly reduced the computation and storage cost by inserting lightweight prompt modules into the pre-trained models and tuning these prompt modules with a small number of trainable parameters, while keeping the transformer backbone frozen. Although only a few parameters need to be adjusted, most PETuning methods still require a significant amount of downstream task training data to achieve good results. The performance is inadequate on low-data regimes, especially when there are only one or two examples per class. To this end, we first empirically identify the poor performance is mainly due to the inappropriate way of initializing prompt modules, which has also been verified in the pre-trained language models. Next, we propose a Pre-trained Visual Parameter-efficient (PVP) Tuning framework, which pre-trains the parameter-efficient tuning modules first and then leverages the pre-trained modules along with the pre-trained transformer backbone to perform parameter-efficient tuning on downstream tasks. Experiment results on five Fine-Grained Visual Classification (FGVC) and VTAB-1k datasets demonstrate that our proposed method significantly outperforms state-of-the-art PETuning methods.

updated: Wed Apr 26 2023 15:55:29 GMT+0000 (UTC)

published: Wed Apr 26 2023 15:55:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト