Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks

Chiara Plizzari; Marco Cannici; Matteo Matteucci

空間的および時間的トランスフォーマーネットワークを介したスケルトンベースの行動認識

スケルトンベースの人間活動認識は、スケルトンデータが照明の変化、体のスケール、動的なカメラビュー、および複雑な背景に対して堅牢であることが実証されているため、近年大きな関心を集めています。特に、空間-時間グラフ畳み込みネットワーク（ST-GCN）は、スケルトングラフなどの非ユークリッドデータに対する空間的および時間的依存性の両方を学習するのに効果的であることが実証されました。それにもかかわらず、3Dスケルトンの基礎となる潜在情報の効果的なエンコードは、特に関節の動きのパターンとそれらの相関関係から効果的な情報を抽出する場合、未解決の問題です。この作業では、トランスフォーマーの自己注意演算子を使用して関節間の依存関係をモデル化する新しい空間-時間トランスフォーマーネットワーク（ST-TR）を提案します。 ST-TRモデルでは、Spatial Self-Attentionモジュール（SSA）を使用して、さまざまな身体部分間のフレーム内相互作用を理解し、Temporal Self-Attentionモジュール（TSA）を使用してフレーム間の相関をモデル化します。この2つは、2ストリームネットワークで組み合わされ、そのパフォーマンスは3つの大規模データセット、NTU-RGB + D 60、NTU-RGB + D 120、およびKinetics Skeleton 400で評価され、最先端のパフォーマンスを上回ります。同じ入力データ、つまりジョイント情報を使用するNTU-RGB + Dwrtモデル。

Skeleton-based Human Activity Recognition has achieved great interest in recent years as skeleton data has demonstrated being robust to illumination changes, body scales, dynamic camera views, and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially when it comes to extracting effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network, whose performance is evaluated on three large-scale datasets, NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400, outperforming the state-of-the-art on NTU-RGB+D w.r.t. models using the same input data, i.e., joint information.

updated: Fri Dec 11 2020 14:49:47 GMT+0000 (UTC)

published: Mon Aug 17 2020 15:25:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト