Learning Interpretable BEV Based VIO without Deep Neural Networks

Zexi Chen; Haozhe Du; Xuecheng Xu; Rong Xiong; Yiyi Liao; Yue Wang

ディープニューラルネットワークを使用しない解釈可能な BEV ベースの VIO の学習

単眼視慣性オドメトリ (VIO) は、ロボット工学と自動運転における重大な問題です。従来の方法では、フィルタリングまたは最適化に基づいてこの問題を解決します。それらは完全に解釈可能ですが、手動の干渉と経験的なパラメータ調整に依存しています。一方、学習ベースのアプローチでは、エンドツーエンドのトレーニングが可能ですが、何百万ものパラメータを学習するには大量のトレーニングデータが必要です。ただし、解釈不可能で重いモデルは、一般化能力を妨げます。このホワイトペーパーでは、深いニューラルネットワークなしでトレーニングできるローカル平面運動を備えたロボット用に、完全に微分可能で解釈可能な鳥瞰図 (BEV) ベースの VIO モデルを提案します。具体的には、最初に Unscented Kalman Filter を微分可能層として採用してピッチとロールを予測します。ここで、ノイズの共分散行列を学習して IMU 生データのノイズを除去します。次に、微分可能なカメラ投影を使用して、各フレームの重力整列 BEV 画像を取得するために、洗練されたピッチとロールが採用されます。最後に、微分可能なポーズ推定器を使用して、BEV フレーム間の残りの 3 DoF ポーズを推定します。これにより、5 DoF ポーズ推定が行われます。私たちの方法は、姿勢推定損失によって監視された共分散行列をエンドツーエンドで学習することを可能にし、経験的なベースラインよりも優れたパフォーマンスを示します。合成データセットと現実世界のデータセットに関する実験結果は、私たちのシンプルなアプローチが最先端の方法と競合し、目に見えないシーンでうまく一般化できることを示しています。

Monocular visual-inertial odometry (VIO) is a critical problem in robotics and autonomous driving. Traditional methods solve this problem based on filtering or optimization. While being fully interpretable, they rely on manual interference and empirical parameter tuning. On the other hand, learning-based approaches allow for end-to-end training but require a large number of training data to learn millions of parameters. However, the non-interpretable and heavy models hinder the generalization ability. In this paper, we propose a fully differentiable, and interpretable, bird-eye-view (BEV) based VIO model for robots with local planar motion that can be trained without deep neural networks. Specifically, we first adopt Unscented Kalman Filter as a differentiable layer to predict the pitch and roll, where the covariance matrices of noise are learned to filter out the noise of the IMU raw data. Second, the refined pitch and roll are adopted to retrieve a gravity-aligned BEV image of each frame using differentiable camera projection. Finally, a differentiable pose estimator is utilized to estimate the remaining 3 DoF poses between the BEV frames: leading to a 5 DoF pose estimation. Our method allows for learning the covariance matrices end-to-end supervised by the pose estimation loss, demonstrating superior performance to empirical baselines. Experimental results on synthetic and real-world datasets demonstrate that our simple approach is competitive with state-of-the-art methods and generalizes well on unseen scenes.

updated: Sat Sep 17 2022 11:56:31 GMT+0000 (UTC)

published: Sat Sep 25 2021 06:54:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト