Part Aware Contrastive Learning for Self-Supervised Action Recognition

Yilei Hua; Wenhan Wu; Ce Zheng; Aidong Lu; Mengyuan Liu; Chen Chen; Shiqian Wu

自己教師あり行動認識のための部分認識対照学習

近年、対照学習を用いたスケルトンシーケンスを用いた自己教師付き行動認識において、顕著な成果が得られています。人間の行動特徴の意味的区別は、多くの場合、足や手などの局所的な身体部分によって表されることが観察されており、これは骨格ベースの行動認識に有利です。この論文では、SkeAttnCLR と呼ばれるスケルトン表現学習のための注意ベースの対照的学習フレームワークを提案します。これは、スケルトンベースのアクション表現のローカル類似性とグローバル機能を統合します。これを実現するために、マルチヘッドアテンションマスクモジュールを使用してスケルトンからソフトアテンションマスク機能を学習し、非顕著なローカル機能を抑制し、ローカルの顕著な機能を強調することで、同様のローカル機能を機能空間に近づけます。さらに、ネットワークが骨格全体のセマンティック表現を学習するように導くグローバルな特徴を備えた顕著な特徴と非顕著な特徴に基づいて対照的なペアを拡張することにより、十分な対照的なペアが生成されます。したがって、アテンションマスクメカニズムを使用して、SkeAttnCLR はさまざまなデータ拡張ビューでローカル機能を学習します。実験結果は、局所的な特徴の類似性を含めることで、スケルトンベースのアクション表現が大幅に向上することを示しています。私たちが提案する SkeAttnCLR は、NTURGB+D、NTU120-RGB+D、および PKU-MMD データセットで最先端の方法よりも優れています。

In recent years, remarkable results have been achieved in self-supervised action recognition using skeleton sequences with contrastive learning. It has been observed that the semantic distinction of human action features is often represented by local body parts, such as legs or hands, which are advantageous for skeleton-based action recognition. This paper proposes an attention-based contrastive learning framework for skeleton representation learning, called SkeAttnCLR, which integrates local similarity and global features for skeleton-based action representations. To achieve this, a multi-head attention mask module is employed to learn the soft attention mask features from the skeletons, suppressing non-salient local features while accentuating local salient features, thereby bringing similar local features closer in the feature space. Additionally, ample contrastive pairs are generated by expanding contrastive pairs based on salient and non-salient features with global features, which guide the network to learn the semantic representations of the entire skeleton. Therefore, with the attention mask mechanism, SkeAttnCLR learns local features under different data augmentation views. The experiment results demonstrate that the inclusion of local feature similarity significantly enhances skeleton-based action representation. Our proposed SkeAttnCLR outperforms state-of-the-art methods on NTURGB+D, NTU120-RGB+D, and PKU-MMD datasets.

updated: Thu May 11 2023 07:26:18 GMT+0000 (UTC)

published: Mon May 01 2023 05:31:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト