Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen; Zhao Zhang; Weili Zeng; Richong Zhang; Feng Zhu; Rui Zhao

Shikra: マルチモーダル LLM の参照対話マジックを解き放つ

人間の会話では、個人は他の人に話しかけながら、シーン内の関連する領域を示すことができます。次に、他の人は、必要に応じて特定の領域を参照して応答できます。対話におけるこの自然な参照能力は、現在のマルチモーダル大規模言語モデル (MLLM) には依然として欠けています。このギャップを埋めるために、この論文では、自然言語で空間座標の入出力を処理できる Shikra と呼ばれる MLLM を提案します。そのアーキテクチャは、ビジョンエンコーダ、アライメント層、および LLM で構成されます。追加の語彙、位置エンコーダー、事前/事後検出モジュール、外部プラグインモデルを必要とせず、簡単かつシンプルになるように設計されています。すべての入力と出力は自然言語形式です。参照対話は、さまざまなビジョン言語 (VL) タスクのスーパーセットです。 Shikra は、REC や PointQA などの位置関連タスクだけでなく、画像キャプションや VQA などの従来の VL タスクも自然に処理できます。実験結果は、Shikra の有望なパフォーマンスを示しています。さらに、一連の思考において言及されたオブジェクトの座標を提供したり、ユーザーが指定した領域の類似性を比較したりするなど、数多くのエキサイティングなアプリケーションが可能になります。私たちのコードとモデルには https://github.com/shikras/shikra からアクセスできます。

In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code and model are accessed at https://github.com/shikras/shikra.

updated: Tue Jun 27 2023 04:31:52 GMT+0000 (UTC)

published: Tue Jun 27 2023 04:31:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト