ConceptFusion: Open-set Multimodal 3D Mapping

Krishna Murthy Jatavallabhula; Alihusein Kuwajerwala; Qiao Gu; Mohd Omama; Tao Chen; Shuang Li; Ganesh Iyer; Soroush Saryazdi; Nikhil Keetha; Ayush Tewari; Joshua B. Tenenbaum; Celso Miguel de Melo; Madhava Krishna; Liam Paull; Florian Shkurti; Antonio Torralba

ConceptFusion: オープンセットマルチモーダル 3D マッピング

環境の 3D マップを作成することは、ロボットのナビゲーション、計画、およびシーン内のオブジェクトとの相互作用の中心です。セマンティックな概念を 3D マップと統合する既存のアプローチのほとんどは、大部分がクローズドセットの設定に限定されたままです。トレーニング時に事前に定義された有限の概念セットについてのみ推論できます。さらに、これらのマップは、クラスラベルを使用するか、最近の作業ではテキストプロンプトを使用してのみクエリを実行できます。 ConceptFusion は、(1) 基本的にオープンセットであり、概念のクローズドセットを超えた推論を可能にし、(ii) 本質的にマルチモーダルであり、言語から 3D マップへの多様な範囲のクエリを可能にするシーン表現です。、画像、音声、3D ジオメトリなど、すべてが連携して動作します。 ConceptFusion は、インターネット規模のデータで事前トレーニングされた今日の基盤モデルのオープンセット機能を活用して、自然言語、画像、音声などのモダリティ全体の概念について推論します。従来の SLAM およびマルチビューフュージョンアプローチを介して、ピクセルアラインされたオープンセット機能を 3D マップに融合できることを示します。これにより、効果的なゼロショット空間推論が可能になり、追加のトレーニングや微調整を必要とせず、教師ありアプローチよりもロングテールコンセプトを保持し、3D IoU で 40% 以上のマージンを上回ります。多数の実世界のデータセット、シミュレートされた家庭環境、実世界の卓上操作タスク、自動運転プラットフォームで ConceptFusion を広範囲に評価します。基礎モデルを 3D オープンセットマルチモーダルマッピングとブレンドするための新しい手段を紹介します。詳細については、プロジェクトページ https://concept-fusion.github.io にアクセスするか、5 分間の説明ビデオをご覧ください https://www.youtube.com/watch?v=rkXgws8fiDs

Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs

updated: Wed Feb 15 2023 01:49:09 GMT+0000 (UTC)

published: Tue Feb 14 2023 18:40:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト