Multi-modal Fusion for Single-Stage Continuous Gesture Recognition

Harshala Gammulle; Simon Denman; Sridha Sridharan; Clinton Fookes

シングルステージ連続ジェスチャ認識のためのマルチモーダルフュージョン

ジェスチャ認識は、ロボット工学や人間と機械の相互作用など、実世界での無数のアプリケーションを備えた、よく研究されている研究分野です。現在のジェスチャ認識方法は、孤立したジェスチャの認識に焦点を合わせており、既存の連続ジェスチャ認識方法は、検出と分類に独立したモデルが必要な2段階のアプローチに限定されており、後者のパフォーマンスは検出パフォーマンスによって制約されます。対照的に、Temporal Multi-Modal Fusion（TMMF）と呼ばれる、単一のモデルを介してビデオ内の複数のジェスチャを検出および分類できる、単一ステージの連続ジェスチャ認識フレームワークを紹介します。このアプローチは、個々のジェスチャを検出するための前処理セグメンテーションステップを必要とせずに、ジェスチャと非ジェスチャの間の自然な遷移を学習します。これを実現するために、マルチモーダル入力から流れる重要な情報の統合をサポートするマルチモーダル融合メカニズムを導入し、任意の数のモードにスケーラブルにします。さらに、ユニモーダル機能マッピング（UFM）モデルとマルチモーダル機能マッピング（MFM）モデルを提案して、それぞれユニモーダル機能と融合マルチモーダル機能をマッピングします。パフォーマンスをさらに向上させるために、グラウンドトゥルースと予測の間のスムーズな位置合わせを促進し、モデルが自然なジェスチャ遷移を学習するのに役立つ中間点ベースの損失関数を提案します。提案されたフレームワークの有用性を示します。このフレームワークは、可変長の入力ビデオを処理でき、EgoGesture、IPN hand、ChaLearn LAP Continuous Gesture Dataset（ConGD）の3つの難しいデータセットで最先端を上回ります。さらに、アブレーション実験は、提案されたフレームワークのさまざまなコンポーネントの重要性を示しています。

Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have focused on recognising isolated gestures, and existing continuous gesture recognition methods are limited to two-stage approaches where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF), that can detect and classify multiple gestures in a video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step to detect individual gestures. To achieve this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance performance, we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction, helping the model to learn natural gesture transitions. We demonstrate the utility of our proposed framework, which can handle variable-length input videos, and outperforms the state-of-the-art on three challenging datasets: EgoGesture, IPN hand, and ChaLearn LAP Continuous Gesture Dataset (ConGD). Furthermore, ablation experiments show the importance of different components of the proposed framework.

updated: Tue Aug 24 2021 06:36:51 GMT+0000 (UTC)

published: Tue Nov 10 2020 07:09:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト