S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans

Junwen Huang; Alexey Artemov; Yujin Chen; Shuaifeng Zhi; Kai Xu; Matthias Nießner

S4R: RGB-D スキャンからの自己管理型セマンティックシーン再構築

3D 屋内空間の包括的なセマンティックモデリングに対するほとんどのディープラーニングアプローチでは、3D ドメインでコストのかかる高密度の注釈が必要です。この作業では、完全に自己管理型のアプローチを使用して、中心的な 3D シーンモデリングタスク、つまりセマンティックシーンの再構成を調べます。この目的のために、不完全な 3D 再構成とそれに対応するソース RGB-D 画像の両方を使用するトレーニング可能なモデルを設計し、クロスドメイン機能をボリューム埋め込みに融合して、完全な 3D ジオメトリ、色、およびセマンティクスを予測します。私たちの重要な技術革新は、観察されたRGB画像と一般的なセマンティックセグメンテーションモデルをそれぞれ色とセマンティクスの監視として使用して、色とセマンティクスの微分可能なレンダリングを活用することです。さらに、元の実際のキャプチャを補完する拡張された仮想トレーニングビューのセットを合成する方法を開発し、セマンティクスのより効率的な自己監視を可能にします。この作業では、3D または 2D のグラウンドトゥルースを使用せずに、いくつかの RGB-D 画像からのジオメトリの補完、色付け、およびセマンティックマッピングに共同で対処するエンドツーエンドのトレーニング可能なソリューションを提案します。私たちの方法は、私たちの知る限り、現実世界の 3D スキャンの完了とセマンティックセグメンテーションに対処する完全に自己管理された方法です。これは、3D 監視付きベースラインと同等のパフォーマンスを発揮し、実際のデータセットで 2D 監視付きのベースラインを上回り、目に見えないシーンにうまく一般化します。

Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction, using a fully self-supervised approach. To this end, we design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics. Our key technical innovation is to leverage differentiable rendering of color and semantics, using the observed RGB images and a generic semantic segmentation model as color and semantics supervision, respectively. We additionally develop a method to synthesize an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision for semantics. In this work we propose an end-to-end trainable solution jointly addressing geometry completion, colorization, and semantic mapping from a few RGB-D images, without 3D or 2D ground-truth. Our method is the first, to our knowledge, fully self-supervised method addressing completion and semantic segmentation of real-world 3D scans. It performs comparably well with the 3D supervised baselines, surpasses baselines with 2D supervision on real datasets, and generalizes well to unseen scenes.

updated: Tue Feb 21 2023 20:50:33 GMT+0000 (UTC)

published: Tue Feb 07 2023 17:47:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト