LISA: Reasoning Segmentation via Large Language Model

Xin Lai; Zhuotao Tian; Yukang Chen; Yanwei Li; Yuhui Yuan; Shu Liu; Jiaya Jia

LISA: 大規模言語モデルによる推論セグメンテーション

認識システムは近年目覚ましい進歩を遂げていますが、視覚認識タスクを実行する前にターゲットオブジェクトまたはカテゴリを識別するには、依然として明示的な人間の指示に依存しています。このようなシステムには、暗黙的なユーザーの意図を積極的に推論して理解する能力がありません。この研究では、新しいセグメンテーションタスクである推論セグメンテーションを提案します。このタスクは、複雑で暗黙的なクエリテキストを指定してセグメンテーションマスクを出力するように設計されています。さらに、評価目的のために複雑な推論と世界の知識を組み込んだ、1,000 を超える画像と命令のペアで構成されるベンチマークを確立します。最後に、LISA: 大規模言語指示セグメンテーションアシスタントを紹介します。これは、マルチモーダル大規模言語モデル (LLM) の言語生成機能を継承しながら、セグメンテーションマスクを生成する機能も備えています。元の語彙を拡張します。トークンを作成し、セグメンテーション機能を解放するマスクとして埋め込むパラダイムを提案します。驚くべきことに、LISA は次のようなケースを処理できます。1) 複雑な推論。 2）世界の知識。 3) 説明的な回答。 4) マルチターン会話。また、推論不要のデータセットのみでトレーニングした場合、堅牢なゼロショット機能を実証します。さらに、わずか 239 個の推論セグメンテーション画像と命令のペアを使用してモデルを微調整すると、パフォーマンスがさらに向上します。実験では、私たちの方法が新しい推論セグメンテーション機能を解放するだけでなく、複雑な推論セグメンテーションと標準的な参照セグメンテーションタスクの両方で効果的であることが証明されたことが示されています。コード、モデル、デモは https://github.com/dvlab-research/LISA にあります。

Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction to identify the target objects or categories before executing visual recognition tasks. Such systems lack the ability to actively reason and comprehend implicit user intentions. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving: 1) complex reasoning; 2) world knowledge; 3) explanatory answers; 4) multi-turn conversation. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement. Experiments show our method not only unlocks new reasoning segmentation capabilities but also proves effective in both complex reasoning segmentation and standard referring segmentation tasks. Code, models, and demo are at https://github.com/dvlab-research/LISA.

updated: Thu Aug 03 2023 17:38:21 GMT+0000 (UTC)

published: Tue Aug 01 2023 17:50:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト