SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

Junwen Huang; Alexey Artemov; Yujin Chen; Shuaifeng Zhi; Kai Xu; Matthias Nießner

SSR-2D: 2D 画像からのセマンティック 3D シーン再構成

3D 屋内空間の包括的なセマンティックモデリングに対するほとんどのディープラーニングアプローチでは、3D ドメインでコストのかかる高密度の注釈が必要です。この作業では、中心的な 3D シーンモデリングタスク、つまり 3D 注釈を使用しないセマンティックシーンの再構築について説明します。私たちのアプローチの重要なアイデアは、不完全な 3D 再構成とそれに対応するソース RGB-D 画像の両方を使用するトレーニング可能なモデルを設計し、クロスドメイン機能をボリューム埋め込みに融合して、2D ラベル付けのみで完全な 3D ジオメトリ、色、およびセマンティクスを予測することです。手動または機械生成のいずれかです。私たちの重要な技術革新は、観察された RGB 画像と 2D セマンティクスをそれぞれ監督として使用して、2D 観察と未知の 3D 空間を橋渡しするために色とセマンティクスの微分可能なレンダリングを活用することです。さらに、学習パイプラインと対応する方法を開発して、不完全な予測 2D ラベルからの学習を可能にします。これは、元の実際のキャプチャを補完する拡張された仮想トレーニングビューのセットで合成することによって追加で取得でき、セマンティクスのより効率的な自己監視ループを可能にします。この作業では、3D グラウンドトゥルース情報に依存することなく、限られた RGB-D 画像からのジオメトリの補完、色付け、およびセマンティックマッピングに共同で対処する、エンドツーエンドのトレーニング可能なソリューションを提案します。私たちの方法は、2 つの大規模なベンチマークデータセット MatterPort3D と ScanNet でセマンティックシーン再構築の最先端のパフォーマンスを達成し、コストのかかる 3D アノテーションを使用してもベースラインを上回ります。私たちの知る限り、私たちの方法は、現実世界の 3D スキャンの完了とセマンティックセグメンテーションに対処する最初の 2D 主導の方法でもあります。

Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics with only 2D labeling which can be either manual or machine-generated. Our key technical innovation is to leverage differentiable rendering of color and semantics to bridge 2D observations and unknown 3D space, using the observed RGB images and 2D semantics as supervision, respectively. We additionally develop a learning pipeline and corresponding method to enable learning from imperfect predicted 2D labels, which could be additionally acquired by synthesizing in an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision loop for semantics. In this work, we propose an end-to-end trainable solution jointly addressing geometry completion, colorization, and semantic mapping from limited RGB-D images, without relying on any 3D ground-truth information. Our method achieves state-of-the-art performance of semantic scene reconstruction on two large-scale benchmark datasets MatterPort3D and ScanNet, surpasses baselines even with costly 3D annotations. To our knowledge, our method is also the first 2D-driven method addressing completion and semantic segmentation of real-world 3D scans.

updated: Thu Apr 20 2023 19:20:30 GMT+0000 (UTC)

published: Tue Feb 07 2023 17:47:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト