SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training

Hong Yan; Yang Liu; Yushen Wei; Zhen Li; Guanbin Li; Liang Lin

SkeletonMAE: スケルトンシーケンスの事前トレーニング用のグラフベースのマスクされたオートエンコーダー

スケルトンシーケンス表現学習は、人間の関節やトポロジーをモデル化する有望な能力により、動作認識に大きな利点を示しています。ただし、現在の方法では通常、計算コストの高いモデルをトレーニングするために十分なラベル付きデータが必要であり、労力と時間がかかります。さらに、これらの方法は、さまざまなスケルトンジョイント間のきめ細かい依存関係を利用して、さまざまなデータセット間で適切に一般化できる効率的なスケルトンシーケンス学習モデルを事前トレーニングする方法を無視しています。この論文では、スケルトンシーケンス学習 (SSL) と呼ばれる効率的なスケルトンシーケンス学習フレームワークを提案します。人間の姿勢を包括的に捕捉し、識別可能なスケルトンシーケンス表現を取得するために、SkeletonMAE という名前の非対称グラフベースのエンコーダ/デコーダ事前トレーニングアーキテクチャを構築します。これは、スケルトンジョイントシーケンスをグラフ畳み込みネットワーク (GCN) に埋め込み、マスクされたスケルトンジョイントとエッジを再構築します。人間のトポロジーに関する以前の知識に基づいています。次に、事前トレーニングされた SkeletonMAE エンコーダーが時空間表現学習 (STRL) モジュールと統合され、SSL フレームワークが構築されます。広範な実験結果により、当社の SSL はさまざまなデータセット間でうまく一般化され、FineGym、Diving48、NTU 60、および NTU 120 データセットにおける最先端の自己教師型スケルトンベースのアクション認識手法よりも優れたパフォーマンスを発揮することが示されています。さらに、完全に監視されたいくつかのメソッドと同等のパフォーマンスが得られます。コードは https://github.com/HongYan1123/SkeletonMAE で入手できます。

Skeleton sequence representation learning has shown great advantages for action recognition due to its promising ability to model human joints and topology. However, the current methods usually require sufficient labeled data for training computationally expensive models, which is labor-intensive and time-consuming. Moreover, these methods ignore how to utilize the fine-grained dependencies among different skeleton joints to pre-train an efficient skeleton sequence learning model that can generalize well across different datasets. In this paper, we propose an efficient skeleton sequence learning framework, named Skeleton Sequence Learning (SSL). To comprehensively capture the human pose and obtain discriminative skeleton sequence representation, we build an asymmetric graph-based encoder-decoder pre-training architecture named SkeletonMAE, which embeds skeleton joint sequence into Graph Convolutional Network (GCN) and reconstructs the masked skeleton joints and edges based on the prior human topology knowledge. Then, the pre-trained SkeletonMAE encoder is integrated with the Spatial-Temporal Representation Learning (STRL) module to build the SSL framework. Extensive experimental results show that our SSL generalizes well across different datasets and outperforms the state-of-the-art self-supervised skeleton-based action recognition methods on FineGym, Diving48, NTU 60 and NTU 120 datasets. Additionally, we obtain comparable performance to some fully supervised methods. The code is avaliable at https://github.com/HongYan1123/SkeletonMAE.

updated: Mon Jul 17 2023 13:33:11 GMT+0000 (UTC)

published: Mon Jul 17 2023 13:33:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト