Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition

Zhen Huang; Xu Shen; Xinmei Tian; Houqiang Li; Jianqiang Huang; Xian-Sheng Hua

スケルトンベースのアクション認識のための時空間開始グラフ畳み込みネットワーク

スケルトンベースの人間の行動認識は、アクセス可能な深度センサーの普及により多くの注目を集めています。最近、グラフ畳み込みネットワーク（GCN）は、グラフデータをモデル化する強力な機能により、このタスクに広く使用されています。隣接グラフのトポロジーは、入力スケルトンの相関をモデル化するための重要な要素です。したがって、以前の方法は主にグラフトポロジの設計/学習に焦点を合わせています。ただし、トポロジが学習されると、ネットワークの各層には単一スケールの機能と1つの変換のみが存在します。畳み込みニューラルネットワーク（CNN）で非常に効果的であることが証明されている、マルチスケール情報や複数の変換セットなどの多くの洞察は、GCNでは調査されていません。その理由は、グラフ構造のスケルトンデータと従来の画像/ビデオデータとの間にギャップがあるため、これらの洞察をGCNに埋め込むことが非常に難しいためです。このギャップを克服するために、スケルトンシーケンス処理のためのGCNの分割-変換-マージ戦略を再発明します。具体的には、スケルトンベースのアクション認識のためのシンプルで高度にモジュール化されたグラフ畳み込みネットワークアーキテクチャを設計します。私たちのネットワークは、空間パスと時間パスの両方からのマルチグラニュラリティ情報を集約するビルディングブロックを繰り返すことによって構築されます。広範な実験により、私たちのネットワークは、パラメータの1/5とFLOPの1/10だけで、最先端の方法を大幅に上回っていることを示しています。コードはhttps://github.com/yellowtownhz/STIGCNで入手できます。

Skeleton-based human action recognition has attracted much attention with the prevalence of accessible depth sensors. Recently, graph convolutional networks (GCNs) have been widely used for this task due to their powerful capability to model graph data. The topology of the adjacency graph is a key factor for modeling the correlations of the input skeletons. Thus, previous methods mainly focus on the design/learning of the graph topology. But once the topology is learned, only a single-scale feature and one transformation exist in each layer of the networks. Many insights, such as multi-scale information and multiple sets of transformations, that have been proven to be very effective in convolutional neural networks (CNNs), have not been investigated in GCNs. The reason is that, due to the gap between graph-structured skeleton data and conventional image/video data, it is very challenging to embed these insights into GCNs. To overcome this gap, we reinvent the split-transform-merge strategy in GCNs for skeleton sequence processing. Specifically, we design a simple and highly modularized graph convolutional network architecture for skeleton-based action recognition. Our network is constructed by repeating a building block that aggregates multi-granularity information from both the spatial and temporal paths. Extensive experiments demonstrate that our network outperforms state-of-the-art methods by a significant margin with only 1/5 of the parameters and 1/10 of the FLOPs. Code is available at https://github.com/yellowtownhz/STIGCN.

updated: Fri Aug 20 2021 02:14:37 GMT+0000 (UTC)

published: Thu Nov 26 2020 14:43:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト