Towards a Unified View on Visual Parameter-Efficient Transfer Learning

Bruce X. B. Yu; Jianlong Chang; Lingbo Liu; Qi Tian; Chang Wen Chen

ビジュアルパラメーター効率の高い転移学習の統一ビューに向けて

パラメーター効率の高い転移学習 (PETL) は、少数のパラメーターを微調整することにより、事前トレーニングされた大規模モデルの表現知識を有効に活用することを目的としています。最近では、自然言語処理 (NLP) ドメインから着想を得て、プロンプトチューニングやアダプターなどの一般的な PETL 手法もビジョンドメインにうまく適用されています。ただし、プレフィックスチューニングはビジョンタスクではまだ調査されていません。この作業では、適切なパラメーター精度のトレードオフで、大規模なビジョンモデル (LVM) をダウンストリームタスクに適応させる予定です。この目標に向けて、ビジュアル PETL (V-PETL) と呼ばれる PETL の統一されたビューを備えたフレームワークを提案し、さまざまな PETL 手法の効果、ダウンストリームドメインのデータスケール、トレーニング可能なパラメーターの位置、および取引に影響を与えるその他の側面を調査します。オフ。具体的には、さまざまな PETL 手法、特に未調査のプレフィックスチューニング手法を実装しながら、データ構造と事前トレーニングメカニズムの観点から、トレーニング可能なパラメーターの位置の重要性と NLP タスクとビジョンタスクの違いを分析します。 NLP とビジョンデータの違いに関する包括的な理解に基づいて、ビジョンダウンストリームタスク用のパラレルアテンション (PATT) と呼ばれるプレフィックスチューニングモジュールの新しいバリエーションを提案します。さまざまな凍結LVMを介したビジョンタスクに関する広範な実証分析が実施されており、その結果、提案されたPATTが他のPETL技術に効果的に貢献できることが示されています。提案された V-PETL フレームワークから派生した効果的なスキーム Swin-BAPAT は、わずかに多くのパラメーターを使用して最先端の AdaptFormer-Swin よりも大幅に優れたパフォーマンスを達成し、はるかに少ないパラメーターでフルチューニングよりも優れています。コードとデータは https://github.com/bruceyo/V-PETL で入手できます。

Parameter efficient transfer learning (PETL) aims at making good use of the representation knowledge in the pre-trained large models by fine-tuning a small number of parameters. Recently, taking inspiration from the natural language processing (NLP) domain, popular PETL techniques such as prompt-tuning and Adapter have also been successfully applied to the vision domain. However, prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large vision models (LVMs) to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view of PETL called visual-PETL (V-PETL) to investigate the effects of different PETL techniques, data scales of downstream domains, positions of trainable parameters, and other aspects affecting the trade-off. Specifically, we analyze the positional importance of trainable parameters and differences between NLP and vision tasks in terms of data structures and pre-training mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of the differences between NLP and vision data, we propose a new variation of the prefix-tuning module called parallel attention (PATT) for vision downstream tasks. An extensive empirical analysis on vision tasks via different frozen LVMs has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far fewer parameters. Code and data are available at: https://github.com/bruceyo/V-PETL.

updated: Thu Mar 02 2023 03:00:36 GMT+0000 (UTC)

published: Mon Oct 03 2022 09:54:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト