Contextual Object Detection with Multimodal Large Language Models

Yuhang Zang; Wei Li; Jun Han; Kaiyang Zhou; Chen Change Loy

マルチモーダル大規模言語モデルによるコンテキストオブジェクト検出

最近のマルチモーダル大規模言語モデル (MLLM) は、画像キャプションや質問応答などの視覚言語タスクでは優れていますが、本質的な認識能力、つまり物体検出が不足しています。この研究では、人間と AI のさまざまなインタラクティブなコンテキスト内で目に見えるオブジェクトを理解する、コンテキストオブジェクト検出という新しい研究課題を導入することで、この制限に対処します。言語閉塞テスト、視覚的なキャプション、質問応答など、3 つの代表的なシナリオが調査されます。さらに、視覚言語コンテキストのエンドツーエンドの微分可能なモデリングが可能な統合マルチモーダルモデルである ContextDET を提示します。これにより、視覚オブジェクトを位置特定し、識別し、人間と AI の対話のための言語入力と関連付けることができます。 ContextDET には 3 つの主要なサブモデルが含まれます: (i) 視覚的表現を抽出するビジュアルエンコーダー、(ii) マルチモーダルコンテキストデコード用の事前トレーニング済み LLM、および (iii) コンテキストオブジェクトの単語が与えられた場合に境界ボックスを予測するビジュアルデコーダー。新しい生成後検出フレームワークにより、人間の語彙内の目的語を検出できるようになります。広範な実験により、私たちが提案する CODE ベンチマーク、オープン語彙検出、および参照画像セグメンテーションにおける ContextDET の利点が示されています。 Github: https://github.com/yuhangzang/ContextDET。

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation. Github: https://github.com/yuhangzang/ContextDET.

updated: Mon May 29 2023 17:50:33 GMT+0000 (UTC)

published: Mon May 29 2023 17:50:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト