A Two-stream Neural Network for Pose-based Hand Gesture Recognition

Chuankun Li; Shuai Li; Yanbo Gao; Xiang Zhang; Wanqing Li

ポーズベースのハンドジェスチャ認識のための2ストリームニューラルネットワーク

ポーズベースの手のジェスチャー認識は、近年広く研究されています。全身の行動認識と比較して、手のジェスチャーは、より強力なコラボレーションでより空間的に密接に分散された関節を含みます。この性質には、行動認識から複雑な空間的特徴のキャプチャまでの異なるアプローチが必要です。「グラブ」や「ピンチ」などの多くのジェスチャカテゴリには、非常によく似たモーションまたは時間パターンがあり、時間処理に課題があります。これらの課題に対処するために、この論文では、2ストリームニューラルネットワークを提案します。1つのストリームは自己注意ベースのグラフ畳み込みネットワーク（SAGCN）であり、短期の時間情報と階層的な空間情報を抽出し、もう1つのストリームは残差接続を強化します。長期的な時間情報を抽出するための双方向の独立リカレントニューラルネットワーク（RBi-IndRNN）。自己注意ベースのグラフ畳み込みネットワークには、GCNでの固定トポロジと局所特徴抽出に加えて、すべての手関節の関係を適応的に活用する動的自己注意メカニズムがあります。一方、残余接続が強化されたBi-IndRNNは、時間モデリングの双方向処理機能を備えたIndRNNを拡張します。 2つのストリームは、認識のために融合されます。 Dynamic HandGestureデータセットとFirst-PersonHand Actionデータセットは、その有効性を検証するために使用され、私たちの方法は最先端のパフォーマンスを実現します。

Pose based hand gesture recognition has been widely studied in the recent years. Compared with full body action recognition, hand gesture involves joints that are more spatially closely distributed with stronger collaboration. This nature requires a different approach from action recognition to capturing the complex spatial features. Many gesture categories, such as "Grab" and "Pinch", have very similar motion or temporal patterns posing a challenge on temporal processing. To address these challenges, this paper proposes a two-stream neural network with one stream being a self-attention based graph convolutional network (SAGCN) extracting the short-term temporal information and hierarchical spatial information, and the other being a residual-connection enhanced bidirectional Independently Recurrent Neural Network (RBi-IndRNN) for extracting long-term temporal information. The self-attention based graph convolutional network has a dynamic self-attention mechanism to adaptively exploit the relationships of all hand joints in addition to the fixed topology and local feature extraction in the GCN. On the other hand, the residual-connection enhanced Bi-IndRNN extends an IndRNN with the capability of bidirectional processing for temporal modelling. The two streams are fused together for recognition. The Dynamic Hand Gesture dataset and First-Person Hand Action dataset are used to validate its effectiveness, and our method achieves state-of-the-art performance.

updated: Fri Jan 22 2021 03:22:26 GMT+0000 (UTC)

published: Fri Jan 22 2021 03:22:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト