Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning

Cheng Zhang; Haocheng Wan; Shengqiang Liu; Xinyi Shen; Zizhao Wu

ポイントボクセルトランスフォーマー：3Dディープラーニングへの効率的なアプローチ

3Dデータのスパース性と不規則性により、ポイントを直接処理するアプローチが一般的になっています。すべてのポイントベースのモデルの中で、Transformerベースのモデルは、ポイントの相互関係を完全に維持することにより、最先端のパフォーマンスを実現しています。ただし、それらのほとんどは、合計時間の高い割合をスパースデータアクセス（たとえば、最も遠いポイントサンプリング（FPS）および隣接ポイントクエリ）に費やし、これが計算の負担になります。したがって、ポイントでの自己注意計算を活用してグローバルコンテキスト機能を収集し、ボクセルでマルチヘッド自己注意（MSA）計算を実行してローカル情報をキャプチャし、不規則なデータアクセスを減らします。さらに、MSA計算のコストをさらに削減するために、クロスボックス接続を維持しながら、MSA計算を重複しないローカルボックスに制限することで効率を高める循環シフトボクシングスキームを設計します。私たちの方法は、Transformerアーキテクチャの可能性を十分に活用し、効率的で正確な認識結果への道を開きます。分類とセグメンテーションのベンチマークで評価された当社のPVTは、高い精度を達成するだけでなく、平均9倍の測定速度で以前の最先端のTransformerベースのモデルを上回ります。 3Dオブジェクト検出タスクでは、Frustrum PointNetのプリミティブをPVTレイヤーに置き換え、8.6％の改善を達成します。

Due to the sparsity and irregularity of the 3D data, approaches that directly process points have become popular. Among all point-based models, Transformer-based models have achieved state-of-the-art performance by fully preserving point interrelation. However, most of them spend high percentage of total time on sparse data accessing (e.g., Farthest Point Sampling (FPS) and neighbor points query), which becomes the computation burden. Therefore, we present a novel 3D Transformer, called Point-Voxel Transformer (PVT) that leverages self-attention computation in points to gather global context features, while performing multi-head self-attention (MSA) computation in voxels to capture local information and reduce the irregular data access. Additionally, to further reduce the cost of MSA computation, we design a cyclic shifted boxing scheme which brings greater efficiency by limiting the MSA computation to non-overlapping local boxes while also preserving cross-box connection. Our method fully exploits the potentials of Transformer architecture, paving the road to efficient and accurate recognition results. Evaluated on classification and segmentation benchmarks, our PVT not only achieves strong accuracy but outperforms previous state-of-the-art Transformer-based models with 9x measured speedup on average. For 3D object detection task, we replace the primitives in Frustrum PointNet with PVT layer and achieve the improvement of 8.6%.

updated: Fri Aug 13 2021 06:07:57 GMT+0000 (UTC)

published: Fri Aug 13 2021 06:07:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト