Self-positioning Point-based Transformer for Point Cloud Understanding

Jinyoung Park; Sanghyeok Lee; Sihyeon Kim; Yunyang Xiong; Hyunwoo J. Kim

点群を理解するための自己位置点ベースの変換器

トランスフォーマーは、さまざまなコンピュータービジョンタスクで優れたパフォーマンスを発揮し、長期的な依存関係をキャプチャする機能を備えています。成功したにもかかわらず、トランスフォーマーをポイントクラウドに直接適用することは、ポイント数の 2 次コストのために困難です。このホワイトペーパーでは、複雑さを軽減してローカル形状コンテキストとグローバル形状コンテキストの両方をキャプチャするように設計された、セルフポジショニングポイントベーストランスフォーマー (SPoTr) を紹介します。具体的には、このアーキテクチャは、ローカルの自己注意と、ポイントベースのグローバルな相互注意の自己配置で構成されています。入力された形状に基づいて適応的に配置された自己位置決めポイントは、空間情報と意味情報の両方を考慮して、表現力を向上させるために注意を解きます。自己位置決めポイントを使用して、点群の新しいグローバル相互注意メカニズムを提案します。これは、注意モジュールが自己位置決めポイントの小さなセットのみで注意の重みを計算できるようにすることで、グローバル自己注意のスケーラビリティを向上させます。実験では、形状分類、パーツセグメンテーション、シーンセグメンテーションなどの 3 つのポイントクラウドタスクに対する SPoTr の有効性が示されています。特に、提案されたモデルは、ScanObjectNN を使用した形状分類に関する以前の最良のモデルよりも 2.6% の精度向上を達成しています。また、自己ポジショニングポイントの解釈可能性を実証する定性分析も提供します。 SPoTr のコードは、https://github.com/mlvlab/SPoTr で入手できます。

Transformers have shown superior performance on various computer vision tasks with their capabilities to capture long-range dependencies. Despite the success, it is challenging to directly apply Transformers on point clouds due to their quadratic cost in the number of points. In this paper, we present a Self-Positioning point-based Transformer (SPoTr), which is designed to capture both local and global shape contexts with reduced complexity. Specifically, this architecture consists of local self-attention and self-positioning point-based global cross-attention. The self-positioning points, adaptively located based on the input shape, consider both spatial and semantic information with disentangled attention to improve expressive power. With the self-positioning points, we propose a novel global cross-attention mechanism for point clouds, which improves the scalability of global self-attention by allowing the attention module to compute attention weights with only a small set of self-positioning points. Experiments show the effectiveness of SPoTr on three point cloud tasks such as shape classification, part segmentation, and scene segmentation. In particular, our proposed model achieves an accuracy gain of 2.6% over the previous best models on shape classification with ScanObjectNN. We also provide qualitative analyses to demonstrate the interpretability of self-positioning points. The code of SPoTr is available at https://github.com/mlvlab/SPoTr.

updated: Wed Mar 29 2023 04:27:11 GMT+0000 (UTC)

published: Wed Mar 29 2023 04:27:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト