DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

Haiyang Wang; Chen Shi; Shaoshuai Shi; Meng Lei; Sen Wang; Di He; Bernt Schiele; Liwei Wang

DSVT: 回転セットを使用した動的スパースボクセルトランスフォーマー

まばらな点群を処理するために、効率的かつ展開しやすい 3D バックボーンを設計することは、3D オブジェクト検出における基本的な問題です。カスタマイズされたスパース畳み込みと比較して、Transformers のアテンションメカニズムは、長期的な関係を柔軟にモデル化するのにより適切であり、実際のアプリケーションに簡単に展開できます。ただし、ポイントクラウドのまばらな特性により、まばらなポイントに標準のトランスフォーマーを適用することは自明ではありません。このホワイトペーパーでは、屋外の 3D オブジェクト検出用のシングルストライドウィンドウベースのボクセルトランスフォーマーバックボーンである Dynamic Sparse Voxel Transformer (DSVT) を紹介します。スパースポイントを並列で効率的に処理するために、Dynamic Sparse Window Attention を提案します。これは、各ウィンドウ内の一連のローカル領域をそのスパース性に従って分割し、すべての領域の特徴を完全に並列に計算します。クロスセット接続を可能にするために、連続するセルフアテンション層で 2 つのパーティション構成を交互に切り替える回転セットパーティション戦略を設計します。効果的なダウンサンプリングをサポートし、幾何学的情報をより適切にエンコードするために、カスタマイズされた CUDA 操作を使用せずに強力で展開しやすい、まばらなポイントでの注意スタイルの 3D プーリングモジュールも提案します。私たちのモデルは、大規模な Waymo Open Dataset で最先端のパフォーマンスを実現し、顕著な成果を上げています。さらに重要なことは、DSVT は TensorRT によってリアルタイムの推論速度 (27Hz) で簡単にデプロイできることです。コードは https://github.com/Haiyang-W/DSVT で入手できます。

Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D object detection. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D object detection. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance on large-scale Waymo Open Dataset with remarkable gains. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at https://github.com/Haiyang-W/DSVT.

updated: Sun Jan 15 2023 09:31:58 GMT+0000 (UTC)

published: Sun Jan 15 2023 09:31:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト