Convolutional Bypasses Are Better Vision Transformer Adapters

Shibo Jie; Zhi-Hong Deng

畳み込みバイパスはより優れたビジョントランスアダプターです

事前トレーニングしてから微調整するパラダイムは、コンピュータービジョンで広く採用されています。しかし、ビジョントランスフォーマー (ViT) のサイズが指数関数的に大きくなるにつれて、ストレージのオーバーヘッドが大きくなるため、完全な微調整は不可能になります。言語トランスフォーマーでのパラメーター効率の高い転移学習 (PETL) に動機付けられた最近の研究では、事前トレーニング済みの ViT に軽量の適応モジュール (アダプターレイヤーやプロンプトトークンなど) を挿入し、事前トレーニング済みの重みが固定されている間にこれらのモジュールのみを微調整しようとしています。ただし、これらのモジュールはもともと言語モデルを微調整するために提案されたものであり、特に視覚的なタスクに関する事前知識は考慮されていませんでした。このホワイトペーパーでは、ViT で Convolutional Bypasses (Convpass) を適応モジュールとして構築し、トレーニング可能なパラメーターを少量 (モデルパラメーターの 0.5% 未満) だけ導入して、大きな ViT を適応させることを提案します。他の PETL メソッドとは異なり、Convpass は畳み込み層のハードコーディングされた誘導バイアスの恩恵を受けるため、特にデータ量の少ない環境での視覚的なタスクにより適しています。 VTAB-1K ベンチマークと少数ショット学習データセットに関する実験結果は、Convpass が現在の言語指向の適応モジュールよりも優れていることを示しており、ビジョンモデルを適応させるためにビジョン指向の適応モジュールを調整する必要性を示しています。

The pretrain-then-finetune paradigm has been widely adopted in computer vision. But as the size of Vision Transformer (ViT) grows exponentially, the full finetuning becomes prohibitive in view of the heavier storage overhead. Motivated by parameter-efficient transfer learning (PETL) on language transformers, recent studies attempt to insert lightweight adaptation modules (e.g., adapter layers or prompt tokens) to pretrained ViT and only finetune these modules while the pretrained weights are frozen. However, these modules were originally proposed to finetune language models and did not take into account the prior knowledge specifically for visual tasks. In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount (less than 0.5% of model parameters) of trainable parameters to adapt the large ViT. Different from other PETL methods, Convpass benefits from the hard-coded inductive bias of convolutional layers and thus is more suitable for visual tasks, especially in the low-data regime. Experimental results on VTAB-1K benchmark and few-shot learning datasets show that Convpass outperforms current language-oriented adaptation modules, demonstrating the necessity to tailor vision-oriented adaptation modules for adapting vision models.

updated: Tue Aug 09 2022 10:40:06 GMT+0000 (UTC)

published: Thu Jul 14 2022 16:32:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト