3D Neural Embedding Likelihood for Robust Probabilistic Inverse Graphics

Guangyao Zhou; Nishad Gothoskar; Lirui Wang; Joshua B. Tenenbaum; Dan Gutfreund; Miguel Lázaro-Gredilla; Dileep George; Vikash K. Mansinghka

ロバストな確率的逆グラフィックスのための 3D ニューラル埋め込みの可能性

3D シーンを認識して理解する能力は、コンピュータービジョンやロボット工学の多くのアプリケーションにとって非常に重要です。インバースグラフィックスは、2D 画像から 3D シーン構造を推測することを目的とした 3D シーン理解への魅力的なアプローチです。このホワイトペーパーでは、逆グラフィックスフレームワークに確率モデリングを導入して、不確実性を定量化し、6D ポーズ推定タスクのロバスト性を実現します。具体的には、RGB-D 画像に対する統一された確率モデルとして 3D Neural Embedding Likelihood (3DNEL) を提案し、3D シーン記述に関する効率的な推論手順を開発します。 3DNEL は、RGB から学習したニューラル埋め込みを深度情報と効果的に組み合わせて、RGB-D 画像からの sim-to-real 6D オブジェクト姿勢推定のロバスト性を向上させます。 YCB-Video データセットのパフォーマンスは、最先端のものと同等ですが、困難な体制でもはるかに堅牢です。識別アプローチとは対照的に、3DNEL の確率的生成定式化は、マルチオブジェクトシーンを共同でモデル化し、原則に基づいた方法で不確実性を定量化し、重度のオクルージョン下でのオブジェクトポーズトラッキングを処理します。最後に、3DNEL は、シーンとオブジェクトに関する事前知識を組み込むための原則的なフレームワークを提供します。これにより、ビデオからのカメラポーズトラッキングなどの追加タスクへの自然な拡張が可能になります。

The ability to perceive and understand 3D scenes is crucial for many applications in computer vision and robotics. Inverse graphics is an appealing approach to 3D scene understanding that aims to infer the 3D scene structure from 2D images. In this paper, we introduce probabilistic modeling to the inverse graphics framework to quantify uncertainty and achieve robustness in 6D pose estimation tasks. Specifically, we propose 3D Neural Embedding Likelihood (3DNEL) as a unified probabilistic model over RGB-D images, and develop efficient inference procedures on 3D scene descriptions. 3DNEL effectively combines learned neural embeddings from RGB with depth information to improve robustness in sim-to-real 6D object pose estimation from RGB-D images. Performance on the YCB-Video dataset is on par with state-of-the-art yet is much more robust in challenging regimes. In contrast to discriminative approaches, 3DNEL's probabilistic generative formulation jointly models multi-object scenes, quantifies uncertainty in a principled way, and handles object pose tracking under heavy occlusion. Finally, 3DNEL provides a principled framework for incorporating prior knowledge about the scene and objects, which allows natural extension to additional tasks like camera pose tracking from video.

updated: Sat Mar 25 2023 00:04:08 GMT+0000 (UTC)

published: Tue Feb 07 2023 20:48:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト