PillarNet: Real-Time and High-Performance Pillar-based 3D Object Detection

Guangsheng Shi; Ruifeng Li; Chao Ma

PillarNet：リアルタイムで高性能のPillarベースの3Dオブジェクト検出

自動運転では、リアルタイムで高性能な3Dオブジェクト検出が非常に重要です。最近の最高性能の3Dオブジェクト検出器は、主にポイントベースまたは3Dボクセルベースの畳み込みに依存しており、どちらもオンボード展開には計算上非効率的です。最近の研究では、パフォーマンスを向上させるためにポイントベースまたは3Dボクセルベースの畳み込みに焦点が当てられていますが、これらの方法は、特に組み込みデバイスへの展開に関する遅延と電力効率の要件を満たしていません。対照的に、ピラーベースの方法は、計算リソースの消費が少ない2D畳み込みのみを使用しますが、検出精度においてボクセルベースの方法よりもはるかに遅れています。ただし、このような3Dボクセルベースの方法がピラーベースの方法よりも優れているのは、3D畳み込みニューラルネットワーク（CNN）の有効性に広く起因しています。この論文では、ピラーベースの検出器とボクセルベースの検出器の間の主要なパフォーマンスのギャップを調べることにより、PillarNetと呼ばれるリアルタイムで高性能のピラーベースの検出器を開発します。提案されたPillarNetは、効果的なピラー特徴学習のための強力なエンコーダネットワーク、空間セマンティック特徴融合のためのネックネットワーク、および一般的に使用される検出ヘッドで構成されています。 PillarNetは、2D畳み込みのみを使用して、オプションのピラーサイズに柔軟に対応し、VGGNetやResNetなどの従来の2DCNNバックボーンと互換性があります。さらに、PillarNetは、IoU対応の予測ブランチとともに、設計された方向分離IoU回帰損失の恩恵を受けています。大規模なnuScenesデータセットとWaymoOpenデータセットに関する広範な実験結果は、提案されたPillarNetが、有効性と効率の点で最先端の3D検出器よりも優れていることを示しています。コードは公開されます。

Real-time and high-performance 3D object detection is of critical importance for autonomous driving. Recent top-performing 3D object detectors mainly rely on point-based or 3D voxel-based convolutions, which are both computationally inefficient for onboard deployment. While recent researches focus on point-based or 3D voxel-based convolutions for higher performance, these methods fail to meet latency and power efficiency requirements especially for deployment on embedded devices. In contrast, pillar-based methods use merely 2D convolutions, which consume less computation resources, but they lag far behind their voxel-based counterparts in detection accuracy. However, the superiority of such 3D voxel-based methods over pillar-based methods is still broadly attributed to the effectiveness of 3D convolution neural network (CNN). In this paper, by examining the primary performance gap between pillar- and voxel-based detectors, we develop a real-time and high-performance pillar-based detector, dubbed PillarNet. The proposed PillarNet consists of a powerful encoder network for effective pillar feature learning, a neck network for spatial-semantic feature fusion and the commonly used detect head. Using only 2D convolutions, PillarNet is flexible to an optional pillar size and compatible with classical 2D CNN backbones, such as VGGNet and ResNet. Additionally, PillarNet benefits from our designed orientation-decoupled IoU regression loss along with the IoU-aware prediction branch. Extensive experimental results on large-scale nuScenes Dataset and Waymo Open Dataset demonstrate that the proposed PillarNet performs well over the state-of-the-art 3D detectors in terms of effectiveness and efficiency. Code will be made publicly available.

updated: Thu May 19 2022 07:37:11 GMT+0000 (UTC)

published: Mon May 16 2022 00:14:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト