FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

Zhijian Liu; Xinyu Yang; Haotian Tang; Shang Yang; Song Han

FlatFormer: 効率的なポイントクラウドトランスフォーマーのための Flattened Window Attention

Transformer は、CNN の代替として、多くのモダリティ (テキストや画像など) で効果的であることが証明されています。 3D ポイントクラウドトランスフォーマーの場合、既存の取り組みは主にその精度を最先端レベルに押し上げることに重点を置いています。ただし、それらのレイテンシーはスパース畳み込みベースのモデルよりも遅れており (3 倍遅く)、リソースに制約があり、レイテンシーの影響を受けやすいアプリケーション (自動運転など) での使用を妨げています。この非効率性は、ポイントクラウドのまばらで不規則な性質によるものですが、トランスフォーマーは高密度で規則的なワークロード向けに設計されています。このホワイトペーパーでは、より優れた計算規則性と空間的近接性を交換することで、このレイテンシギャップを埋める FlatFormer を紹介します。まず、ウィンドウベースの並べ替えを使用してポイントクラウドを平坦化し、ポイントを同じ形状のウィンドウではなく、同じサイズのグループに分割します。これにより、高価な構造化とパディングのオーバーヘッドが効果的に回避されます。次に、グループ内で自己注意を適用して局所的な特徴を抽出し、並べ替え軸を交互に並べてさまざまな方向から特徴を収集し、ウィンドウをシフトしてグループ間で特徴を交換します。 FlatFormer は、Waymo Open Dataset で最先端の精度を実現し、(トランスフォーマーベースの) SST よりも 4.6 倍高速になり、(スパース畳み込み) CenterPoint よりも 1.4 倍高速になります。これは、エッジ GPU でリアルタイムパフォーマンスを実現し、スパース畳み込み法よりも高速でありながら、大規模なベンチマークで同等またはそれ以上の精度を達成する最初の点群変換器です。

Transformer, as an alternative to CNN, has been proven effective in many modalities (e.g., texts and images). For 3D point cloud transformers, existing efforts focus primarily on pushing their accuracy to the state-of-the-art level. However, their latency lags behind sparse convolution-based models (3x slower), hindering their usage in resource-constrained, latency-sensitive applications (such as autonomous driving). This inefficiency comes from point clouds' sparse and irregular nature, whereas transformers are designed for dense, regular workloads. This paper presents FlatFormer to close this latency gap by trading spatial proximity for better computational regularity. We first flatten the point cloud with window-based sorting and partition points into groups of equal sizes rather than windows of equal shapes. This effectively avoids expensive structuring and padding overheads. We then apply self-attention within groups to extract local features, alternate sorting axis to gather features from different directions, and shift windows to exchange features across groups. FlatFormer delivers state-of-the-art accuracy on Waymo Open Dataset with 4.6x speedup over (transformer-based) SST and 1.4x speedup over (sparse convolutional) CenterPoint. This is the first point cloud transformer that achieves real-time performance on edge GPUs and is faster than sparse convolutional methods while achieving on-par or even superior accuracy on large-scale benchmarks.

updated: Fri Jul 14 2023 18:57:30 GMT+0000 (UTC)

published: Fri Jan 20 2023 18:59:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト