3D Object Aided Self-Supervised Monocular Depth Estimation

Songlin Wei; Guodong Chen; Wenzheng Chi; Zhenhua Wang; Lining Sun

3D オブジェクト支援自己教師あり単眼深度推定

単眼深度推定は、ロボットビジョン、自動運転、3D シーンの理解などの分野で活発に研究されています。カラー画像のシーケンスが与えられると、Structure-From-Motion (SfM) のフレームワークに基づく教師なし学習方法が、深度とカメラの相対的な姿勢を同時に予測します。ただし、シーン内で動的に移動するオブジェクトは、静的な世界の仮定に違反するため、動的オブジェクトの深度が不正確になります。この作業では、単眼 3D オブジェクト検出を通じて、このような動的なオブジェクトの動きに対処する新しい方法を提案します。具体的には、まず画像内の 3D オブジェクトを検出し、カメラの動きでモデル化される剛体の背景に対応する静的ピクセルを残しながら、動的ピクセルと検出されたオブジェクトの姿勢とのピクセルごとの対応を構築します。このようにして、すべてのピクセルの深度は、意味のあるジオメトリモデルを介して学習できます。さらに、オブジェクトは絶対スケールの直方体として検出されます。これは、単眼視に固有のスケールのあいまいさの問題を排除するために使用されます。 KITTI深度データセットの実験は、私たちの方法が深度推定の最先端のパフォーマンスを達成することを示しています。さらに、深度、カメラの動き、オブジェクトのポーズの共同トレーニングにより、単眼 3D オブジェクト検出のパフォーマンスも向上します。私たちの知る限りでは、これは単眼 3D オブジェクト検出ネットワークを自己管理型の方法で微調整できる最初の作業です。

Monocular depth estimation has been actively studied in fields such as robot vision, autonomous driving, and 3D scene understanding. Given a sequence of color images, unsupervised learning methods based on the framework of Structure-From-Motion (SfM) simultaneously predict depth and camera relative pose. However, dynamically moving objects in the scene violate the static world assumption, resulting in inaccurate depths of dynamic objects. In this work, we propose a new method to address such dynamic object movements through monocular 3D object detection. Specifically, we first detect 3D objects in the images and build the per-pixel correspondence of the dynamic pixels with the detected object pose while leaving the static pixels corresponding to the rigid background to be modeled with camera motion. In this way, the depth of every pixel can be learned via a meaningful geometry model. Besides, objects are detected as cuboids with absolute scale, which is used to eliminate the scale ambiguity problem inherent in monocular vision. Experiments on the KITTI depth dataset show that our method achieves State-of-The-Art performance for depth estimation. Furthermore, joint training of depth, camera motion and object pose also improves monocular 3D object detection performance. To the best of our knowledge, this is the first work that allows a monocular 3D object detection network to be fine-tuned in a self-supervised manner.

updated: Sun Dec 04 2022 08:52:33 GMT+0000 (UTC)

published: Sun Dec 04 2022 08:52:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト