A Deep Learning Framework for Recognizing both Static and Dynamic Gestures

Osama Mazhar; Sofiane Ramdani; Andrea Cherubini

静的ジェスチャと動的ジェスチャの両方を認識するためのディープラーニングフレームワーク

直感的なユーザーインターフェイスは、人間中心のスマート環境と対話するために不可欠です。この論文では、単純なRGBビジョン（深度検知なし）を使用して、静的ジェスチャと動的ジェスチャの両方を認識する統合フレームワークを提案します。この機能により、社会的または産業的な設定での安価な人間とロボットの相互作用に適しています。ポーズ駆動型の空間的注意戦略を採用しています。これは、提案されている静的および動的ジェスチャネットワークであるStaDNetをガイドします。人間の上半身の画像から、手の周りの関心領域とともに、彼/彼女の深さを推定します。 StaDNetの畳み込みニューラルネットワークは、背景が置換された手のジェスチャーデータセットで微調整されています。これは、各手に対して10個の静的ジェスチャを検出するため、および手の画像埋め込みを取得するために使用されます。これらはその後、拡張ポーズベクトルと融合され、スタックされたLong Short-TermMemoryブロックに渡されます。したがって、増強されたポーズベクトルおよび左手／右手の画像埋め込みからの人間中心のフレームごとの情報は、演じる人の動的なジェスチャーを予測するために時間内に集約される。多くの実験で、提案されたアプローチが大規模なChalearn2016データセットの最先端の結果を上回っていることを示しています。さらに、提案された方法論を通じて学んだ知識を実践ジェスチャデータセットに転送し、得られた結果もこのデータセットの最先端を上回っています。

Intuitive user interfaces are indispensable to interact with the human centric smart environments. In this paper, we propose a unified framework that recognizes both static and dynamic gestures, using simple RGB vision (without depth sensing). This feature makes it suitable for inexpensive human-robot interaction in social or industrial settings. We employ a pose-driven spatial attention strategy, which guides our proposed Static and Dynamic gestures Network - StaDNet. From the image of the human upper body, we estimate his/her depth, along with the region-of-interest around his/her hands. The Convolutional Neural Network in StaDNet is fine-tuned on a background-substituted hand gestures dataset. It is utilized to detect 10 static gestures for each hand as well as to obtain the hand image-embeddings. These are subsequently fused with the augmented pose vector and then passed to the stacked Long Short-Term Memory blocks. Thus, human-centred frame-wise information from the augmented pose vector and from the left/right hands image-embeddings are aggregated in time to predict the dynamic gestures of the performing person. In a number of experiments, we show that the proposed approach surpasses the state-of-the-art results on the large-scale Chalearn 2016 dataset. Moreover, we transfer the knowledge learned through the proposed methodology to the Praxis gestures dataset, and the obtained results also outscore the state-of-the-art on this dataset.

updated: Wed Mar 17 2021 10:31:16 GMT+0000 (UTC)

published: Thu Jun 11 2020 10:39:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト