Dense Voxel Fusion for 3D Object Detection

Anas Mahmoud; Jordan S. K. Hu; Steven L. Waslander

3Dオブジェクト検出のための高密度ボクセル融合

カメラとLiDARセンサーのモダリティは、自動運転車のアプリケーションで3Dオブジェクトを検出するのに役立つ補完的な外観と幾何学的情報を提供します。ただし、現在の融合モデルは、3Dオブジェクト検出ベンチマークで最先端のLiDARのみの方法を下回っています。私たちが提案するソリューションであるDenseVoxel Fusion（DVF）は、マルチスケールマルチモーダル高密度ボクセル特徴表現を生成するシーケンシャルフュージョン法であり、低点密度領域での表現力を向上させます。マルチモーダル学習を強化するために、グラウンドトゥルースの2Dバウンディングボックスラベルを使用して直接トレーニングし、ノイズの多い検出器固有の2D予測を回避します。さらに、LiDARグラウンドトゥルースサンプリングを使用して、欠落した2D検出をシミュレートし、トレーニングの収束を加速します。 DVFとマルチモーダルトレーニングアプローチはどちらも、追加の学習可能なパラメーターを導入することなく、ボクセルベースのLiDARバックボーンに適用できます。 DVFは、既存のスパース融合検出器を上回り、提出時にKITTIの3D車検出ベンチマークで公開されているすべての融合方法の中で1位にランクされ、Waymo OpenDatasetでのボクセルベースの方法の3D車両検出パフォーマンスを大幅に向上させます。また、提案されたマルチモーダルトレーニング戦略が、誤った2D予測を使用したトレーニングと比較してより良い一般化をもたらすことも示しています。

Camera and LiDAR sensor modalities provide complementary appearance and geometric information useful for detecting 3D objects for autonomous vehicle applications. However, current fusion models underperform state-of-art LiDAR-only methods on 3D object detection benchmarks. Our proposed solution, Dense Voxel Fusion (DVF) is a sequential fusion method that generates multi-scale multi-modal dense voxel feature representations, improving expressiveness in low point density regions. To enhance multi-modal learning, we train directly with ground truth 2D bounding box labels, avoiding noisy, detector-specific, 2D predictions. Additionally, we use LiDAR ground truth sampling to simulate missed 2D detections and to accelerate training convergence. Both DVF and the multi-modal training approaches can be applied to any voxel-based LiDAR backbone without introducing additional learnable parameters. DVF outperforms existing sparse fusion detectors, ranking 1^st among all published fusion methods on KITTI's 3D car detection benchmark at the time of submission and significantly improves 3D vehicle detection performance of voxel-based methods on the Waymo Open Dataset. We also show that our proposed multi-modal training strategy results in better generalization compared to training using erroneous 2D predictions.

updated: Wed Mar 02 2022 04:51:31 GMT+0000 (UTC)

published: Wed Mar 02 2022 04:51:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト