Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Zaid Khan; Yun Fu

パラメータ効率の高い転移学習による視覚と言語の対照的な整合

対照的な視覚言語モデル (例: CLIP) は、通常、対照的なトレーニングを通じて視覚モデルと言語モデルのすべてのパラメーターを更新することによって作成されます。このようなモデルは、トレーニング済みの言語モデルとビジョンモデルに少数のパラメーターを更新するだけで作成できますか?文献には、言語モデルの少数のパラメーターを更新することで視覚言語モデルを作成できる手法が記載されていますが、これらは既に調整された視覚的表現を必要とし、対照的ではないため、ニューラル検索などの遅延に敏感なアプリケーションには使用できません。転移学習を通じて、パラメーター効率の高い対照的な視覚と言語の調整の実現可能性と利点を探ります。既に訓練された視覚と言語モデルを最小限に更新することで、CLIP などのモデルを作成します。最小セットのパラメーター更新 (<7%) でフルモデルトレーニングと同じパフォーマンスを達成でき、特定のコンポーネントの更新 (<1% のパラメーター) はフルモデルトレーニングの 75% に匹敵することがわかりました。一連の実験について説明します。既存の知識がパラメーター効率の高いトレーニングでより強力に保存され、パラメーター効率のスケーリングがモデルとデータセットのサイズに合わせてスケーリングされることを示します。ペア画像のテキストデータが不足しているが、強力な多言語言語モデルが存在する場合 (リソースの少ない言語など)、完全なモデルのトレーニングよりもパラメーター効率の高いトレーニングの方が適しています。計算予算が固定されている場合、パラメーター効率の高いトレーニングを使用すると、同じハードウェアでより大きなモデルをトレーニングし、同等のパフォーマンスをより短時間で実現できます。したがって、パラメータ効率の高いトレーニングは、対照的な視覚言語モデルのエネルギー効率が高く効果的なトレーニング戦略を構成します。これは、一般的なユースケースのフルモデルトレーニングパラダイムよりも望ましい場合があります。 https://github.com/codezakh/LilT のコードと重み。

Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates (<7%) can achieve the same performance as full-model training, and updating specific components (<1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT.

updated: Tue Mar 21 2023 14:12:08 GMT+0000 (UTC)

published: Tue Mar 21 2023 14:12:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト