Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes

Dan Xu; Andrea Vedaldi; Joao F. Henriques

動くSLAM：非剛体シーンでの完全に教師なしの深層学習

深いネットワークをトレーニングして、ビデオを3Dジオメトリ（カメラと深度）、移動するオブジェクト、およびそれらの動きに、監視なしで分解する方法を提案します。予測された相対ポーズと深度マップで指定された、異なる視点からソース画像を再レンダリングするために古典的なカメラジオメトリを使用するビュー合成のアイデアに基づいています。合成画像とビデオ内の対応する実画像との間のエラーを最小限に抑えることにより、ポーズと深度を予測する深いネットワークを完全に教師なしでトレーニングできます。ただし、ビュー合成方程式は、オブジェクトが移動しないという強い仮定に依存しています。この厳格な世界の仮定は、予測力を制限し、オブジェクトについての学習を自動的に除外します。簡単な解決策を提案します。代わりに、画像の小さな領域のエラーを最小限に抑えます。シーン全体がリジッドでない場合もありますが、移動するオブジェクトの内部など、ほぼリジッドな小さな領域を見つけることは常に可能です。私たちのネットワークは、スライディングウィンドウで、各地域のさまざまなポーズを予測できます。これは、6Dオブジェクトのモーションを含む、非常に豊富なモデルを表しており、複雑さはほとんどありません。 KITTIの監視されていないオドメトリと深度予測に関する新しい最先端の結果を確立します。また、深度、オドメトリ、オブジェクトのセグメンテーション、またはモーションに関するグラウンドトゥルース情報がない、屋内ビデオの挑戦的なデータセットであるEPIC-Kitchensの新機能についても説明します。しかし、すべては私たちの方法によって自動的に回復されます。

We propose a method to train deep networks to decompose videos into 3D geometry (camera and depth), moving objects, and their motions, with no supervision. We build on the idea of view synthesis, which uses classical camera geometry to re-render a source image from a different point-of-view, specified by a predicted relative pose and depth map. By minimizing the error between the synthetic image and the corresponding real image in a video, the deep network that predicts pose and depth can be trained completely unsupervised. However, the view synthesis equations rely on a strong assumption: that objects do not move. This rigid-world assumption limits the predictive power, and rules out learning about objects automatically. We propose a simple solution: minimize the error on small regions of the image instead. While the scene as a whole may be non-rigid, it is always possible to find small regions that are approximately rigid, such as inside a moving object. Our network can then predict different poses for each region, in a sliding window. This represents a significantly richer model, including 6D object motions, with little additional complexity. We establish new state-of-the-art results on unsupervised odometry and depth prediction on KITTI. We also demonstrate new capabilities on EPIC-Kitchens, a challenging dataset of indoor videos, where there is no ground truth information for depth, odometry, object segmentation or motion. Yet all are recovered automatically by our method.

updated: Wed May 05 2021 17:08:10 GMT+0000 (UTC)

published: Wed May 05 2021 17:08:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト