A Real-Time Online Learning Framework for Joint 3D Reconstruction and Semantic Segmentation of Indoor Scenes

Davide Menini; Suryansh Kumar; Martin R. Oswald; Erik Sandstrom; Cristian Sminchisescu; Luc Van Gool

屋内シーンの共同3D再構成とセマンティックセグメンテーションのためのリアルタイムオンライン学習フレームワーク

この論文では、屋内シーンの3D構造とセマンティックラベルを共同で復元するためのリアルタイムオンラインビジョンフレームワークを紹介します。ノイズの多い深度マップ、カメラの軌跡、および列車時の2Dセマンティックラベルが与えられると、提案されたニューラルネットワークは、フレーム上の深度をシーン空間内の適切なセマンティックラベルと融合することを学習します。私たちのアプローチは、このタスクを解決するために、シーンの特徴空間における深さとセマンティクスの共同ボリューム表現を活用します。セマンティックラベルとジオメトリをリアルタイムで魅力的にオンラインで融合するために、効率的な渦プーリングブロックを導入し、ルーティングネットワークをオンライン深度融合でドロップして、高周波表面の詳細を保持します。シーンのセマンティクスによって提供されるコンテキスト情報が、深度融合ネットワークがノイズ耐性機能を学習するのに役立つことを示します。それだけでなく、薄いオブジェクト構造、厚みのあるアーティファクト、および偽の表面を処理する際の現在のオンライン深度融合方法の欠点を克服するのに役立ちます。レプリカデータセットの実験的評価は、私たちのアプローチが、深度マップの解像度に応じて、それぞれ88％、91％の平均再構成Fスコアで37、10フレーム/秒で深度融合を実行できることを示しています。さらに、私たちのモデルは、ScanNet3Dセマンティックベンチマークリーダーボードで平均IoUスコア0.515を示しています。

This paper presents a real-time online vision framework to jointly recover an indoor scene's 3D structure and semantic label. Given noisy depth maps, a camera trajectory, and 2D semantic labels at train time, the proposed neural network learns to fuse the depth over frames with suitable semantic labels in the scene space. Our approach exploits the joint volumetric representation of the depth and semantics in the scene feature space to solve this task. For a compelling online fusion of the semantic labels and geometry in real-time, we introduce an efficient vortex pooling block while dropping the routing network in online depth fusion to preserve high-frequency surface details. We show that the context information provided by the semantics of the scene helps the depth fusion network learn noise-resistant features. Not only that, it helps overcome the shortcomings of the current online depth fusion method in dealing with thin object structures, thickening artifacts, and false surfaces. Experimental evaluation on the Replica dataset shows that our approach can perform depth fusion at 37, 10 frames per second with an average reconstruction F-score of 88%, and 91%, respectively, depending on the depth map resolution. Moreover, our model shows an average IoU score of 0.515 on the ScanNet 3D semantic benchmark leaderboard.

updated: Wed Aug 11 2021 14:29:01 GMT+0000 (UTC)

published: Wed Aug 11 2021 14:29:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト