Spatial-Temporal Transformer for 3D Point Cloud Sequences

Yimin Wei; Hao Liu; Tingting Xie; Qiuhong Ke; Yulan Guo

3D点群シーケンス用の時空間トランスフォーマー

点群シーケンス内の時空間情報の効果的な学習は、4Dセマンティックセグメンテーションや3Dアクション認識などの多くのダウンストリームタスクにとって非常に重要です。この論文では、動的3D点群シーケンスから時空間表現を学習するために、Point Spatial-Temporal Transformer（PST2）という名前の新しいフレームワークを提案します。私たちのPST2は、時空間自己注意（STSA）モジュールと解像度埋め込み（RE）モジュールの2つの主要なモジュールで構成されています。 STSAモジュールは、隣接するフレーム全体の時空間コンテキスト情報をキャプチャするために導入され、REモジュールは、特徴マップの解像度を向上させるために、隣接するフレーム全体の特徴を集約するために提案されています。 PST2の有効性を、点群シーケンスに対する2つの異なるタスク、つまり4Dセマンティックセグメンテーションと3Dアクション認識でテストします。 3つのベンチマークでの広範な実験は、PST2がすべてのデータセットで既存の方法よりも優れていることを示しています。 STSAおよびREモジュールの有効性は、アブレーション実験でも正当化されています。

Effective learning of spatial-temporal information within a point cloud sequence is highly important for many down-stream tasks such as 4D semantic segmentation and 3D action recognition. In this paper, we propose a novel framework named Point Spatial-Temporal Transformer (PST2) to learn spatial-temporal representations from dynamic 3D point cloud sequences. Our PST2 consists of two major modules: a Spatio-Temporal Self-Attention (STSA) module and a Resolution Embedding (RE) module. Our STSA module is introduced to capture the spatial-temporal context information across adjacent frames, while the RE module is proposed to aggregate features across neighbors to enhance the resolution of feature maps. We test the effectiveness our PST2 with two different tasks on point cloud sequences, i.e., 4D semantic segmentation and 3D action recognition. Extensive experiments on three benchmarks show that our PST2 outperforms existing methods on all datasets. The effectiveness of our STSA and RE modules have also been justified with ablation experiments.

updated: Tue Oct 19 2021 07:55:47 GMT+0000 (UTC)

published: Tue Oct 19 2021 07:55:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト