AFT-VO: Asynchronous Fusion Transformers for Multi-View Visual Odometry Estimation

Nimet Kaygusuz; Oscar Mendez; Richard Bowden

AFT-VO：マルチビュー視覚オドメトリ推定のための非同期融合トランスフォーマー

モーションエスティメーションアプローチでは、通常、カルマンフィルターなどのセンサーフュージョン技術を使用して、個々のセンサーの障害を処理します。最近では、ディープラーニングベースの融合アプローチが提案されており、パフォーマンスが向上し、モデル固有の実装が少なくて済みます。ただし、現在のディープフュージョンアプローチでは、センサーが同期されていることを前提としていることが多く、これは、特に低コストのハードウェアの場合、常に実用的であるとは限りません。この制限に対処するために、この作業では、複数のセンサーからVOを推定するための新しいトランスベースのセンサーフュージョンアーキテクチャであるAFT-VOを提案します。私たちのフレームワークは、非同期マルチビューカメラからの予測を組み合わせ、さまざまなソースからの測定値の時間の不一致を考慮しています。私たちのアプローチでは、最初に混合密度ネットワーク（MDN）を使用して、システム内のすべてのカメラの6-DoFポーズの確率分布を推定します。次に、新しいトランスベースのフュージョンモジュールであるAFT-VOが導入されました。これは、これらの非同期ポーズ推定とその信頼性を組み合わせたものです。より具体的には、マルチソース非同期信号の融合を可能にする離散化およびソースエンコーディング技術を紹介します。人気のあるnuScenesおよびKITTIデータセットに対するアプローチを評価します。私たちの実験は、VO推定のためのマルチビューフュージョンが堅牢で正確な軌道を提供し、厳しい気象条件と照明条件の両方で最先端の性能を上回っていることを示しています。

Motion estimation approaches typically employ sensor fusion techniques, such as the Kalman Filter, to handle individual sensor failures. More recently, deep learning-based fusion approaches have been proposed, increasing the performance and requiring less model-specific implementations. However, current deep fusion approaches often assume that sensors are synchronised, which is not always practical, especially for low-cost hardware. To address this limitation, in this work, we propose AFT-VO, a novel transformer-based sensor fusion architecture to estimate VO from multiple sensors. Our framework combines predictions from asynchronous multi-view cameras and accounts for the time discrepancies of measurements coming from different sources. Our approach first employs a Mixture Density Network (MDN) to estimate the probability distributions of the 6-DoF poses for every camera in the system. Then a novel transformer-based fusion module, AFT-VO, is introduced, which combines these asynchronous pose estimations, along with their confidences. More specifically, we introduce Discretiser and Source Encoding techniques which enable the fusion of multi-source asynchronous signals. We evaluate our approach on the popular nuScenes and KITTI datasets. Our experiments demonstrate that multi-view fusion for VO estimation provides robust and accurate trajectories, outperforming the state of the art in both challenging weather and lighting conditions.

updated: Fri Sep 16 2022 13:47:18 GMT+0000 (UTC)

published: Sun Jun 26 2022 19:29:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト