LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao; Jiaming Han; Renrui Zhang; Ziyi Lin; Shijie Geng; Aojun Zhou; Wei Zhang; Pan Lu; Conghui He; Xiangyu Yue; Hongsheng Li; Yu Qiao

LLaMA-Adapter V2: パラメータ効率の高い視覚的指示モデル

大規模言語モデル (LLM) を命令フォロワーに効率的に変換する方法は、最近人気のある研究の方向性ですが、マルチモーダル推論のための LLM のトレーニングはあまり調査されていません。最近の LLaMA-Adapter は、LLM を使用して視覚入力を処理する可能性を示していますが、まだオープンエンドの視覚的命令にうまく一般化できず、GPT-4 に遅れをとっています。この論文では、パラメータ効率の高い視覚的指示モデルである LLaMA-Adapter V2 を紹介します。具体的には、最初に、より学習可能なパラメーター (ノルム、バイアス、スケールなど) のロックを解除することにより、LLaMA-Adapter を拡張します。これにより、アダプターに加えて LLaMA モデル全体に命令追従能力が分散されます。第二に、ビジュアルトークンを初期の LLM レイヤーにのみフィードする早期融合戦略を提案し、ビジュアルナレッジのより良い組み込みに貢献します。第 3 に、画像とテキストのペアと命令に従うデータの共同トレーニングパラダイムは、学習可能なパラメーターの互いに素なグループを最適化することによって導入されます。この戦略は、画像とテキストの配置と命令に従うという 2 つのタスク間の干渉を効果的に軽減し、小規模な画像とテキストと命令のデータセットだけで強力なマルチモーダル推論を実現します。推論中に、追加のエキスパートモデル (キャプション/OCR システムなど) を LLaMA-Adapter に組み込み、トレーニングコストを負担することなく画像理解機能をさらに強化します。オリジナルの LLaMA-Adapter と比較して、当社の LLaMA-Adapter V2 は、LLaMA に 14M のパラメーターを導入するだけで、無制限のマルチモーダル命令を実行できます。新しく設計されたフレームワークは、言語のみの指示に従う機能も強化されており、チャットのやり取りにも優れています。私たちのコードとモデルは、https://github.com/ZrrSkywalker/LLaMA-Adapter で入手できます。

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

updated: Fri Apr 28 2023 17:59:25 GMT+0000 (UTC)

published: Fri Apr 28 2023 17:59:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト