Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from Video

Jianfeng Li; Junqiao Zhao; Shuangfu Song; Tiantian Feng

ビデオからの深さ、オプティカルフロー、エゴモーションの教師なし共同学習

奥行き、カメラの動き、画像からオプティカルフローなどの幾何学的要素を推定することは、ロボットの視覚認識の重要な部分です。 3 つの幾何学的要素を推定するために、共同自己監視法を使用します。深度ネットワーク、オプティカルフローネットワーク、カメラモーションネットワークは互いに独立していますが、トレーニングフェーズ中に共同で最適化されます。独立したトレーニングと比較して、ジョイントトレーニングは、幾何学的要素間の幾何学的関係を最大限に活用し、シーンの動的および静的な情報を提供できます。この論文では、ネットワーク構造、動的オブジェクトセグメンテーション、および幾何学的制約の 3 つの側面からジョイント自己監視方法を改善します。ネットワーク構造に関しては、カメラモーションネットワークにアテンションメカニズムを適用します。これは、フレーム間のカメラモーションの類似性を利用するのに役立ちます。そして、Transformer のアテンションメカニズムに従って、プラグアンドプレイの畳み込みアテンションモジュールを提案します。動的オブジェクトに関しては、オプティカルフロー自己監視フレームワークと深度ポーズ自己監視フレームワークにおける動的オブジェクトの異なる影響に従って、動的領域を検出し、損失関数でそれをそれぞれマスクするためのしきい値アルゴリズムを提案します。 .幾何学的な制約に関しては、従来の方法を使用して、対応する点から基本行列を推定し、カメラモーションネットワークを制約します。 KITTI データセットでの方法の有効性を示します。他の関節自己監視法と比較して、私たちの方法は、姿勢とオプティカルフローの推定で最先端のパフォーマンスを達成し、深度推定も競争力のある結果を達成しています。コードは https://github.com/jianfenglihg/Unsupervised_geometry で入手できます。

Estimating geometric elements such as depth, camera motion, and optical flow from images is an important part of the robot's visual perception. We use a joint self-supervised method to estimate the three geometric elements. Depth network, optical flow network and camera motion network are independent of each other but are jointly optimized during training phase. Compared with independent training, joint training can make full use of the geometric relationship between geometric elements and provide dynamic and static information of the scene. In this paper, we improve the joint self-supervision method from three aspects: network structure, dynamic object segmentation, and geometric constraints. In terms of network structure, we apply the attention mechanism to the camera motion network, which helps to take advantage of the similarity of camera movement between frames. And according to attention mechanism in Transformer, we propose a plug-and-play convolutional attention module. In terms of dynamic object, according to the different influences of dynamic objects in the optical flow self-supervised framework and the depth-pose self-supervised framework, we propose a threshold algorithm to detect dynamic regions, and mask that in the loss function respectively. In terms of geometric constraints, we use traditional methods to estimate the fundamental matrix from the corresponding points to constrain the camera motion network. We demonstrate the effectiveness of our method on the KITTI dataset. Compared with other joint self-supervised methods, our method achieves state-of-the-art performance in the estimation of pose and optical flow, and the depth estimation has also achieved competitive results. Code will be available https://github.com/jianfenglihg/Unsupervised_geometry.

updated: Sun May 30 2021 12:39:48 GMT+0000 (UTC)

published: Sun May 30 2021 12:39:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト