BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

Yang Zhao; Zhijie Lin; Daquan Zhou; Zilong Huang; Jiashi Feng; Bingyi Kang

BuboGPT: マルチモーダル LLM で視覚的なグラウンディングを可能にする

LLM は、特に命令に従うデータの使用において、言語を通じて人間と対話する際に優れた能力を実証してきました。 MiniGPT-4、LLaVA、X-LLM などの LLM の最近の進歩により、画像、ビデオ、音声などのマルチモーダル入力を組み込むことで、その機能がさらに拡張されました。これらの LLM は、特定のモダリティ信号の正確かつ詳細な言語理解を生成する効果があるにもかかわらず、入力の特定の部分を接地する機能を放棄し、したがって粗粒度のマッピングのみを構築します。ただし、テキストと他のモダリティ間の明示的かつ有益な対応は、ユーザーエクスペリエンスを向上させるだけでなく、マルチモーダル LLM のアプリケーションシナリオの拡張にも役立ちます。したがって、我々は、視覚、音声、言語の間でクロスモーダルな相互作用を実行でき、視覚オブジェクトやその他の与えられたモダリティのきめ細かい理解を提供できる、視覚的基盤を備えたマルチモーダル LLM である BuboGPT を提案します。その結果、BuboGPT は、オブジェクトに対する応答または説明を生成するときに、画像内のオブジェクトの特定の位置を指摘することができます。私たちの貢献は 2 つあります。 1) 文内のエンティティを抽出し、画像内の対応するマスクを見つける SAM に基づく既製のビジュアルグラウンディングモジュール。 2) テキスト、画像、音声の共同理解を与えるための 2 段階のトレーニングスキームと指導データセット。私たちの実験は、BuboGPT が人間との対話中に優れたマルチモダリティ理解と視覚的グラウンディング能力を達成することを示しています。任意のモダリティの組み合わせ (整列または非整列) によって提供される場合、一貫して良好なパフォーマンスを発揮します。私たちのコード、モデル、データセットは https://bubo-gpt.github.io で入手できます。

LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .

updated: Mon Jul 17 2023 15:51:47 GMT+0000 (UTC)

published: Mon Jul 17 2023 15:51:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト