Voxel Transformer for 3D Object Detection

Jiageng Mao; Yujing Xue; Minzhe Niu; Haoyue Bai; Jiashi Feng; Xiaodan Liang; Hang Xu; Chunjing Xu

3Dオブジェクト検出用のボクセルトランスフォーマー

点群からの3Dオブジェクト検出のための新規で効果的なボクセルベースのTransformerバックボーンであるVoxelTransformer（VoTr）を紹介します。ボクセルベースの3D検出器の従来の3D畳み込みバックボーンは、受容野が限られているため、オブジェクトの認識とローカリゼーションに不可欠な大きなコンテキスト情報を効率的にキャプチャできません。この論文では、自己注意によってボクセル間の長距離関係を可能にするTransformerベースのアーキテクチャを導入することによって問題を解決します。空でないボクセルは自然にまばらですが多数あるという事実を考えると、標準のTransformerをボクセルに直接適用することは簡単ではありません。この目的のために、空のボクセル位置と空でないボクセル位置を効果的に操作できるスパースボクセルモジュールとサブマニホールドボクセルモジュールを提案します。畳み込み対応物に匹敵する計算オーバーヘッドを維持しながら注意範囲をさらに拡大するために、これら2つのモジュールでマルチヘッド注意のための2つの注意メカニズムを提案します：ローカル注意と拡張注意、さらに高速ボクセルクエリを提案してマルチヘッドアテンション。 VoTrには、一連のスパースおよびサブマニホールドボクセルモジュールが含まれており、ほとんどのボクセルベースの検出器に適用できます。提案されたVoTrは、KITTIデータセットとWaymo Openデータセットの計算効率を維持しながら、畳み込みベースラインに対して一貫した改善を示しています。

We present Voxel Transformer (VoTr), a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds. Conventional 3D convolutional backbones in voxel-based 3D detectors cannot efficiently capture large context information, which is crucial for object recognition and localization, owing to the limited receptive fields. In this paper, we resolve the problem by introducing a Transformer-based architecture that enables long-range relationships between voxels by self-attention. Given the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, we propose the sparse voxel module and the submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, we propose two attention mechanisms for multi-head attention in those two modules: Local Attention and Dilated Attention, and we further propose Fast Voxel Query to accelerate the querying process in multi-head attention. VoTr contains a series of sparse and submanifold voxel modules and can be applied in most voxel-based detectors. Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Waymo Open dataset.

updated: Mon Sep 13 2021 13:28:39 GMT+0000 (UTC)

published: Mon Sep 06 2021 14:10:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト