Hypergraph Transformer for Skeleton-based Action Recognition

Yuxuan Zhou; Zhi-Qi Cheng; Chao Li; Yanwen Fang; Yifeng Geng; Xuansong Xie; Margret Keuper

スケルトンベースのアクション認識のための Hypergraph Transformer

骨格ベースの動作認識は、骨格の相互接続を持つ人間の関節座標が与えられたときに、人間の動作を認識することを目的としています。ジョイントを頂点として、それらの自然な接続をエッジとしてグラフを定義することにより、以前の研究ではグラフ畳み込みネットワーク (GCN) をうまく採用してジョイント共起をモデル化し、優れたパフォーマンスを達成しました。最近では、GCN の制限が特定されました。つまり、トポロジはトレーニング後に固定されます。このような制限を緩和するために、Self-Attention (SA) メカニズムが採用され、GCN のトポロジーが入力に適応するようになり、最先端のハイブリッドモデルが実現しました。同時に、単純なトランスフォーマーを使用した試みも行われましたが、構造的な事前定義がないため、最先端の GCN ベースの方法にはまだ遅れをとっています。ハイブリッドモデルとは異なり、グラフ距離埋め込みを介してボーン接続を Transformer に組み込むためのより洗練されたソリューションを提案します。私たちの埋め込みはトレーニング中に骨格構造の情報を保持しますが、GCN は単に初期化のためにそれを使用します。さらに重要なことは、一般的なグラフモデルの根底にある問題を明らかにすることです。つまり、ペアワイズ集約は、身体の関節間の高次の運動学的依存性を本質的に無視します。このギャップを埋めるために、Hypergraph Self-Attention (HyperSA) と呼ばれるハイパーグラフの新しい自己注意 (SA) メカニズムを提案し、固有の高次関係をモデルに組み込みます。得られたモデルを Hyperformer と名付け、NTU RGB+D、NTU RGB+D 120、および Northwestern-UCLA データセットでの精度と効率に関して、最先端のグラフモデルを凌駕しています。

Skeleton-based action recognition aims to recognize human actions given human joint coordinates with skeletal interconnections. By defining a graph with joints as vertices and their natural connections as edges, previous works successfully adopted Graph Convolutional networks (GCNs) to model joint co-occurrences and achieved superior performance. More recently, a limitation of GCNs is identified, i.e., the topology is fixed after training. To relax such a restriction, Self-Attention (SA) mechanism has been adopted to make the topology of GCNs adaptive to the input, resulting in the state-of-the-art hybrid models. Concurrently, attempts with plain Transformers have also been made, but they still lag behind state-of-the-art GCN-based methods due to the lack of structural prior. Unlike hybrid models, we propose a more elegant solution to incorporate the bone connectivity into Transformer via a graph distance embedding. Our embedding retains the information of skeletal structure during training, whereas GCNs merely use it for initialization. More importantly, we reveal an underlying issue of graph models in general, i.e., pairwise aggregation essentially ignores the high-order kinematic dependencies between body joints. To fill this gap, we propose a new self-attention (SA) mechanism on hypergraph, termed Hypergraph Self-Attention (HyperSA), to incorporate intrinsic higher-order relations into the model. We name the resulting model Hyperformer, and it beats state-of-the-art graph models w.r.t. accuracy and efficiency on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.

updated: Mon Mar 13 2023 05:19:15 GMT+0000 (UTC)

published: Thu Nov 17 2022 15:36:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト