LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion

Xin Li; Tao Ma; Yuenan Hou; Botian Shi; Yucheng Yang; Youquan Liu; Xingjiao Wu; Qin Chen; Yikang Li; Yu Qiao; Liang He

LoGoNet: ローカルからグローバルへのクロスモーダルフュージョンによる正確な 3D オブジェクト検出に向けて

LiDAR カメラフュージョン法は、3D オブジェクトの検出において印象的なパフォーマンスを示しています。最近の高度なマルチモーダル手法では、主にグローバルフュージョンが実行されます。この場合、画像の特徴と点群の特徴がシーン全体で融合されます。このような方法では、領域レベルの詳細な情報が不足しており、最適ではない融合パフォーマンスが得られます。このホワイトペーパーでは、ローカルレベルとグローバルレベルの両方で LiDAR カメラフュージョンを実行する、新しいローカルツーグローバルフュージョンネットワーク (LoGoNet) を紹介します。具体的には、LoGoNet のグローバルフュージョン (GoF) は以前の文献に基づいて構築されていますが、ボクセルフィーチャの位置をより正確に表すためにポイントセントロイドのみを使用して、より優れたクロスモーダルアライメントを実現しています。ローカルフュージョン (LoF) に関しては、まず各提案を均一なグリッドに分割し、次にこれらのグリッドの中心を画像に投影します。投影されたグリッドポイントの周囲の画像特徴は、提案の周囲の豊富なコンテキスト情報を最大限に活用して、位置装飾された点群特徴と融合するためにサンプリングされます。 Feature Dynamic Aggregation (FDA) モジュールは、これらのローカルおよびグローバルに融合された機能間の情報の相互作用を実現するためにさらに提案されているため、より有益なマルチモーダル機能を生成します。 Waymo Open Dataset (WOD) と KITTI データセットの両方での広範な実験により、LogoNet がすべての最先端の 3D 検出方法よりも優れていることが示されています。特に、LoGoNet は Waymo の 3D オブジェクト検出リーダーボードで 1 位にランクされ、81.02 mAPH (L2) の検出性能を獲得しています。初めて、3 つのクラスの検出性能が同時に 80 APH (L2) を超えたことは注目に値します。コードは https://github.com/sankin97/LoGoNet で入手できます。

LiDAR-camera fusion methods have shown impressive performance in 3D object detection. Recent advanced multi-modal methods mainly perform global fusion, where image features and point cloud features are fused across the whole scene. Such practice lacks fine-grained region-level information, yielding suboptimal fusion performance. In this paper, we present the novel Local-to-Global fusion network (LoGoNet), which performs LiDAR-camera fusion at both local and global levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous literature, while we exclusively use point centroids to more precisely represent the position of voxel features, thus achieving better cross-modal alignment. As to the Local Fusion (LoF), we first divide each proposal into uniform grids and then project these grid centers to the images. The image features around the projected grid points are sampled to be fused with position-decorated point cloud features, maximally utilizing the rich contextual information around the proposals. The Feature Dynamic Aggregation (FDA) module is further proposed to achieve information interaction between these locally and globally fused features, thus producing more informative multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD) and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy that, for the first time, the detection performance on three classes surpasses 80 APH (L2) simultaneously. Code will be available at https://github.com/sankin97/LoGoNet.

updated: Tue Mar 07 2023 02:00:34 GMT+0000 (UTC)

published: Tue Mar 07 2023 02:00:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト