Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Gen Luo; Yiyi Zhou; Tianhe Ren; Shengxin Chen; Xiaoshuai Sun; Rongrong Ji

安価で迅速: 大規模な言語モデル向けの効率的な視覚言語命令チューニング

最近、汎用人工知能の次のマイルストーンとみなされているビジョン言語 (VL) 学習など、大規模言語モデル (LLM) のマルチモーダル機能の拡張に対する関心が高まっています。ただし、既存のソリューションは法外に高価であり、過剰なパラメーターを最適化する必要があるだけでなく、VL 命令チューニングの前に別の大規模な事前トレーニングも必要になります。この論文では、混合モダリティ適応 (MMA) と呼ばれる、LLM の効果的な VL 適応のための新しくて手頃なソリューションを提案します。画像エンコーダと LLM を接続するために大規模なニューラルネットワークを使用する代わりに、MMA は軽量モジュール、つまりアダプタを採用して、LLM と VL タスクの間のギャップを橋渡しします。これにより、画像モデルと言語モデルの共同最適化も可能になります。一方、MMA には、LLM が自然言語理解能力を損なうことなくシングルモーダル命令とマルチモーダル命令の間で自動的に移行できるようにするルーティングアルゴリズムも装備されています。 MMA を検証するために、MMA を LLaMA と呼ばれる最近の LLM に適用し、この形成された大規模な視覚言語指示モデルを LaVIN と呼びます。 MMA と LaVIN を検証するために、私たちは、マルチモーダル科学質問応答とマルチモーダル対話という 2 つのセットアップの下で広範な実験を実施します。実験結果は、LaVIN が既存のマルチモーダル LLM に比べて優れたパフォーマンスと優れたトレーニング効率を実証するだけでなく、汎用チャットボットとしての大きな可能性を裏付けています。さらに重要なことは、LaVIN の実際の支出が非常に安く、たとえば 380 万のトレーニング可能なパラメータでわずか 1.4 時間のトレーニングであり、MMA の有効性が大きく裏付けられています。私たちのプロジェクトは https://luogen1996.github.io/lavin でリリースされています。

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at https://luogen1996.github.io/lavin.

updated: Tue Oct 24 2023 09:34:02 GMT+0000 (UTC)

published: Wed May 24 2023 11:06:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト