A Large RGB-D Dataset for Semi-supervised Monocular Depth Estimation

Jaehoon Cho; Dongbo Min; Youngjung Kim; Kwanghoon Sohn

半教師あり単眼深度推定のための大きなRGB-Dデータセット

単眼深度推定のための現在の自己監視方式は、主に、トレーニング段階でステレオ画像ペアまたは単眼シーケンスを活用する、深くネストされた畳み込みネットワークに基づいています。ただし、多くの場合、遮蔽された領域と深度境界の周囲で不正確な結果が表示されます。この論文では、ステレオ画像ペアを使用した単眼深度推定のためのシンプルで効果的なアプローチを提示します。この研究は、浅い学生ネットワークがより深く、より正確な教師ネットワークから得られた補助情報で訓練される学生-教師戦略を提案することを目的としています。具体的には、最初に3Dジオメトリの両眼知覚を十分に活用してステレオ教師ネットワークをトレーニングし、次に教師ネットワークの深度予測を使用して、単眼深度推論のために学生ネットワークをトレーニングします。これにより、ラベルのない大規模なステレオペアから利用可能なすべての深度データを活用できます。データアンサンブルを使用して教師ネットワークの複数の深度予測をマージし、単一の予測を超えて重要な知識を収集することでトレーニングサンプルを改善する戦略を提案します。学生ネットワークのトレーニング時に使用される不正確な深度推定を改善するために、オクルージョン、テクスチャのない領域、および反復パターンで信頼性の低い疑似深度値を処理するステレオ信頼誘導回帰損失をさらに提案します。屋外の運転シーンを含む既存のデータセットを補完するために、ハンドヘルドステレオカメラを使用して撮影された100万枚の屋外ステレオ画像からなる新しい大規模データセットを構築しました。最後に、単眼深度推定ネットワークが高レベルの視覚タスクに適した特徴表現を提供することを示します。さまざまな屋外シナリオの実験結果は、最先端のアプローチよりも優れたアプローチの有効性と柔軟性を示しています。

Current self-supervised methods for monocular depth estimation are largely based on deeply nested convolutional networks that leverage stereo image pairs or monocular sequences during a training phase. However, they often exhibit inaccurate results around occluded regions and depth boundaries. In this paper, we present a simple yet effective approach for monocular depth estimation using stereo image pairs. The study aims to propose a student-teacher strategy in which a shallow student network is trained with the auxiliary information obtained from a deeper and more accurate teacher network. Specifically, we first train the stereo teacher network by fully utilizing the binocular perception of 3-D geometry and then use the depth predictions of the teacher network to train the student network for monocular depth inference. This enables us to exploit all available depth data from massive unlabeled stereo pairs. We propose a strategy that involves the use of a data ensemble to merge the multiple depth predictions of the teacher network to improve the training samples by collecting non-trivial knowledge beyond a single prediction. To refine the inaccurate depth estimation that is used when training the student network, we further propose stereo confidence-guided regression loss that handles the unreliable pseudo depth values in occlusion, texture-less region, and repetitive pattern. To complement the existing dataset comprising outdoor driving scenes, we built a novel large-scale dataset consisting of one million outdoor stereo images taken using hand-held stereo cameras. Finally, we demonstrate that the monocular depth estimation network provides feature representations that are suitable for high-level vision tasks. The experimental results for various outdoor scenarios demonstrate the effectiveness and flexibility of our approach, which outperforms state-of-the-art approaches.

updated: Fri Oct 22 2021 03:23:24 GMT+0000 (UTC)

published: Tue Apr 23 2019 10:02:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト