Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes

Dan Xu; Andrea Vedaldi; Joao F. Henriques

移動 SLAM: 非リジッドシーンでの完全に教師なしの深層学習

ビデオを 3D ジオメトリ (カメラと深度)、移動するオブジェクト、およびそれらの動きに分解するように、監視なしでディープネットワークをトレーニングする方法を提案します。古典的なカメラジオメトリを使用して、予測された相対姿勢と深度マップによって指定された別の視点からソース画像を再レンダリングするビュー合成のアイデアに基づいています。合成画像とビデオ内の対応する実際の画像との間のエラーを最小限に抑えることで、ポーズと深度を予測するディープネットワークを完全に教師なしでトレーニングできます。ただし、ビューの合成方程式は、オブジェクトは移動しないという強い仮定に基づいています。この厳格な世界の仮定は、予測力を制限し、オブジェクトについて自動的に学習することを排除します。代わりに、画像の小さな領域のエラーを最小限に抑えるという簡単な解決策を提案します。シーン全体がリジッドではない場合でも、移動するオブジェクトの内部など、ほぼリジッドな小さな領域を見つけることは常に可能です。次に、ネットワークは、学習した高密度姿勢マップからスライディングウィンドウで、各領域のさまざまな姿勢を予測できます。これは、6D オブジェクトの動きを含む非常に豊富なモデルを表し、複雑さはほとんどありません。 KITTI では、監視されていないオドメトリーと深度予測で非常に競争力のあるパフォーマンスを達成しています。また、深度、走行距離、オブジェクトセグメンテーション、モーションに関するグラウンドトゥルース情報がない、屋内ビデオの難しいデータセットである EPIC-Kitchens の新機能も示します。しかし、すべてが私たちの方法で自動的に回復します。

We propose a method to train deep networks to decompose videos into 3D geometry (camera and depth), moving objects, and their motions, with no supervision. We build on the idea of view synthesis, which uses classical camera geometry to re-render a source image from a different point-of-view, specified by a predicted relative pose and depth map. By minimizing the error between the synthetic image and the corresponding real image in a video, the deep network that predicts pose and depth can be trained completely unsupervised. However, the view synthesis equations rely on a strong assumption: that objects do not move. This rigid-world assumption limits the predictive power, and rules out learning about objects automatically. We propose a simple solution: minimize the error on small regions of the image instead. While the scene as a whole may be non-rigid, it is always possible to find small regions that are approximately rigid, such as inside a moving object. Our network can then predict different poses for each region, in a sliding window from a learned dense pose map. This represents a significantly richer model, including 6D object motions, with little additional complexity. We achieve very competitive performance on unsupervised odometry and depth prediction on KITTI. We also demonstrate new capabilities on EPIC-Kitchens, a challenging dataset of indoor videos, where there is no ground truth information for depth, odometry, object segmentation or motion. Yet all are recovered automatically by our method.

updated: Tue Jun 01 2021 06:49:35 GMT+0000 (UTC)

published: Wed May 05 2021 17:08:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト