Multi-modal Multi-level Fusion for 3D Single Object Tracking

Zhiheng Li; Yubo Cui; Zuoxu Gu; Zheng Fang

3D 単一オブジェクト追跡のためのマルチモーダルマルチレベルフュージョン

3D 単一オブジェクト追跡は、コンピュータービジョンにおいて重要な役割を果たします。主流の方法は主に点群に依存して、ターゲットテンプレートと検索領域の間のジオメトリマッチングを実現します。ただし、テクスチャがなく不完全な点群により、シングルモーダルトラッカーが同様の構造を持つオブジェクトを区別することが困難になります。ジオメトリマッチングの制限を克服するために、点群の画像テクスチャとジオメトリ特性を利用して 3D ターゲットを追跡するマルチモーダルマルチレベルフュージョントラッカー (MMF-Track) を提案します。具体的には、まず、RGB 画像を 3D 空間の点群と位置合わせするための空間アライメントモジュール (SAM) を提案します。これは、モーダル間の関連付けを構築するための前提条件です。次に、機能インタラクションレベルで、デュアルストリーム構造に基づいて機能インタラクションモジュール (FIM) を設計します。これは、モーダル内の機能を並行して強化し、モーダル間の意味論的な関連付けを構築します。一方、各モーダル特徴を改良するために、Coarse-to-Fine Interaction Module (CFIM) を導入し、さまざまなスケールで階層的な特徴の相互作用を実現します。最後に、類似性融合レベルでは、ターゲットからのジオメトリとテクスチャの手がかりを集約するための類似性融合モジュール (SFM) を提案します。実験の結果、私たちの手法は KITTI で最先端のパフォーマンス (以前のマルチモーダル手法と比較して 39% の成功率と 42% の精度の向上) を達成し、NuScenes でも競争力があることがわかりました。

3D single object tracking plays a crucial role in computer vision. Mainstream methods mainly rely on point clouds to achieve geometry matching between target template and search area. However, textureless and incomplete point clouds make it difficult for single-modal trackers to distinguish objects with similar structures. To overcome the limitations of geometry matching, we propose a Multi-modal Multi-level Fusion Tracker (MMF-Track), which exploits the image texture and geometry characteristic of point clouds to track 3D target. Specifically, we first propose a Space Alignment Module (SAM) to align RGB images with point clouds in 3D space, which is the prerequisite for constructing inter-modal associations. Then, in feature interaction level, we design a Feature Interaction Module (FIM) based on dual-stream structure, which enhances intra-modal features in parallel and constructs inter-modal semantic associations. Meanwhile, in order to refine each modal feature, we introduce a Coarse-to-Fine Interaction Module (CFIM) to realize the hierarchical feature interaction at different scales. Finally, in similarity fusion level, we propose a Similarity Fusion Module (SFM) to aggregate geometry and texture clues from the target. Experiments show that our method achieves state-of-the-art performance on KITTI (39% Success and 42% Precision gains against previous multi-modal method) and is also competitive on NuScenes.

updated: Thu May 11 2023 13:34:02 GMT+0000 (UTC)

published: Thu May 11 2023 13:34:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト