PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition

Neel Trivedi; Ravi Kiran Sarvadevabhatla

PSUMNet: 統合されたモダリティパーツストリームは、効率的なポーズベースの動作認識に必要なすべてです

ポーズベースのアクション認識は、入力スケルトンをモノリシックな方法で処理するアプローチ、つまりポーズツリー内のジョイントが全体として処理されるアプローチによって主に取り組まれています。しかし、このようなアプローチは、アクションカテゴリが、手 (「親指を立てる」など) または脚 (「キック」など) を含むパーツジョイントグループの小さなサブセットのみを含むローカライズされたアクションダイナミクスによって特徴付けられることが多いという事実を無視しています。パーツグループ化に基づくアプローチは存在しますが、各パーツグループはグローバルポーズフレーム内で考慮されないため、このような方法では不十分です。さらに、従来のアプローチでは、独立したモダリティストリーム (関節、骨、関節速度、骨速度など) を採用し、これらのストリームでネットワークを複数回トレーニングするため、トレーニングパラメーターの数が大幅に増加します。これらの問題に対処するために、スケーラブルで効率的なポーズベースのアクション認識のための新しいアプローチである PSUMNet を紹介します。表現レベルでは、従来のモダリティベースのストリームとは対照的に、グローバルフレームベースのパートストリームアプローチを提案します。各パートストリーム内で、複数のモダリティからの関連データが統合され、処理パイプラインによって消費されます。実験的に、PSUMNet は、広く使用されている NTURGB+D 60/120 データセットと高密度関節スケルトンデータセット NTU 60-X/120-X で最先端のパフォーマンスを達成しています。 PSUMNet は非常に効率的で、100% ～ 400% 多くのパラメーターを使用する競合する方法よりも優れています。 PSUMNet は、競争力のあるパフォーマンスを備えた SHREC ハンドジェスチャデータセットにも一般化されています。全体として、PSUMNet のスケーラビリティ、パフォーマンス、および効率性により、アクションの認識や、計算制限のある組み込みデバイスやエッジデバイスへの展開に魅力的な選択肢となっています。コードと事前トレーニング済みのモデルは、https://github.com/skelemoa/psumnet でアクセスできます

Pose-based action recognition is predominantly tackled by approaches which treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. `Thumbs up') or legs (e.g. `Kicking'). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters. To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet's scalability, performance and efficiency makes it an attractive choice for action recognition and for deployment on compute-restricted embedded and edge devices. Code and pretrained models can be accessed at https://github.com/skelemoa/psumnet

updated: Thu Aug 11 2022 12:12:07 GMT+0000 (UTC)

published: Thu Aug 11 2022 12:12:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト