mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye; Haiyang Xu; Guohai Xu; Jiabo Ye; Ming Yan; Yiyang Zhou; Junyang Wang; Anwen Hu; Pengcheng Shi; Yaya Shi; Chenliang Li; Yuanhong Xu; Hehong Chen; Junfeng Tian; Qi Qian; Ji Zhang; Fei Huang; Jingren Zhou

mPLUG-Owl: モジュール化により、マルチモダリティを備えた大規模な言語モデルが強化されます

大規模言語モデル (LLM) は、さまざまな制限のないタスクで印象的なゼロショット機能を実証してきましたが、最近の研究では、LLM をマルチモーダル生成に使用することも検討されています。この研究では、mPLUG-Owl を紹介します。これは、基礎 LLM、視覚的知識モジュール、および視覚的抽象化モジュールのモジュール化された学習を通じて、LLM にマルチモーダル能力を装備する新しいトレーニングパラダイムです。このアプローチは、複数のモダリティをサポートし、モダリティのコラボレーションを通じて多様な単峰性および多峰性能力を促進できます。 mPLUG-Owl のトレーニングパラダイムには、画像とテキストを整列させるための 2 段階の方法が含まれます。これは、LLM の生成能力を維持し、さらには向上させながら、LLM の助けを借りて視覚的な知識を学習します。最初の段階では、ビジュアルナレッジモジュールとアブストラクションモジュールが凍結された LLM モジュールでトレーニングされ、画像とテキストが整列されます。第 2 段階では、言語のみおよびマルチモーダルの教師ありデータセットを使用して、視覚的知識モジュールをフリーズすることにより、LLM の低ランク適応 (LoRA) モジュールと抽象化モジュールを共同で微調整します。視覚関連の命令評価セット OwlEval を慎重に構築します。実験結果は、私たちのモデルが既存のマルチモーダルモデルよりも優れていることを示しており、mPLUG-Owl の印象的な指示と視覚的理解能力、複数ターンの会話能力、および知識推論能力を示しています。さらに、複数画像の相関関係やシーンテキストの理解など、予想外でエキサイティングな能力がいくつか観察されます。これにより、視覚のみの文書理解など、より困難な実際のシナリオに活用することが可能になります。私たちのコード、事前トレーニング済みモデル、命令調整済みモデル、および評価セットは、https://github.com/X-PLUG/mPLUG-Owl で入手できます。オンラインデモは、https://www.modelscope.cn/studios/damo/mPLUG-Owl で入手できます。

Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.

updated: Fri Mar 29 2024 08:13:38 GMT+0000 (UTC)

published: Thu Apr 27 2023 13:27:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト