VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

Zi-Yuan Hu; Yanyang Li; Michael R. Lyu; Liwei Wang

VL-PET: 粒度制御による視覚と言語パラメータの効率的なチューニング

事前トレーニング済み言語モデル (PLM) のモデルサイズが急速に増大するにつれて、完全な微調整はモデルのトレーニングと保存に法外なコストがかかります。ビジョンアンドランゲージ (VL) では、モジュラー変更 (アダプタや LoRA など) をエンコーダ/デコーダ PLM に統合するパラメータ効率チューニング (PET) 技術が提案されています。トレーニング可能なパラメータの小さなセットを調整することにより、これらの手法は完全な微調整と同等のパフォーマンスを発揮します。しかし、既存の PET 技術 (VL アダプターなど) はこれらの重大な問題を見落としている一方で、過度のモジュール変更やエンコーダーとデコーダー間の機能ギャップの無視はパフォーマンスの低下につながる可能性があります。この論文では、新しい粒度制御メカニズムを介してモジュール変更を効果的に制御するための、視覚と言語パラメータ効率的チューニング (VL-PET) フレームワークを提案します。このメカニズムによって生成されたさまざまな粒度制御マトリックスを考慮すると、モデルに依存しないさまざまな VL-PET モジュールをフレームワークからインスタンス化して、効率と有効性のトレードオフを向上させることができます。さらに、エンコーダの VL アライメントとモデリングを強化し、デコーダのテキスト生成を維持するための軽量 PET モジュール設計を提案します。 4 つの画像テキストタスクと 4 つのビデオテキストタスクに対して行われた広範な実験により、VL-PET フレームワークの効率、有効性、および移転可能性が実証されました。特に、軽量 PET モジュールを備えた当社の VL-PET-large 設計は、画像テキストタスクにおいて、BART ベース (T5 ベース) で VL-Adapter を 2.92% (3.41%)、LoRA を 3.37% (7.03%) 上回っています。さらに、当社の VL-PET 設計を既存の PET 技術に採用することによる効果の向上を検証し、パフォーマンスの大幅な向上を実現します。私たちのコードは https://github.com/HenryHZY/VL-PET で入手できます。

As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter and LoRA) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the functionality gap between the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these critical issues. In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight PET module designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders. Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness and transferability of our VL-PET framework. In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37% (7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate the enhanced effect of employing our VL-PET designs on existing PET techniques, enabling them to achieve significant performance improvements. Our code is available at https://github.com/HenryHZY/VL-PET.

updated: Fri Aug 18 2023 20:18:30 GMT+0000 (UTC)

published: Fri Aug 18 2023 20:18:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト