Skeleton Aware Multi-modal Sign Language Recognition

Songyao Jiang; Bin Sun; Lichen Wang; Yue Bai; Kunpeng Li; Yun Fu

スケルトン対応マルチモーダル手話認識

手話は、聴覚障害者や言語障害者がコミュニケーションをとるために一般的に使用されていますが、習得するには多大な労力が必要です。手話認識（SLR）は、特定のビデオから手話を認識することにより、手話ユーザーと他のユーザーとの間のギャップを埋めることを目的としています。手話は、手のジェスチャー、体の姿勢、さらには顔の表情の高速で複雑な動きで実行されるため、これは不可欠でありながら困難な作業です。最近、スケルトンベースのアクション認識は、被写体と背景の変化の間の独立性のためにますます注目を集めています。ただし、スケルトンベースのSLRは、手のキーポイントに注釈がないため、まだ調査中です。手のキーポイントを抽出し、ニューラルネットワークを介して手話を認識することを学ぶためにポーズ推定器を備えた手検出器を使用するためにいくつかの努力がなされてきましたが、それらのどれもRGBベースの方法より優れていません。この目的のために、我々は、より高い認識率に向けてマルチモーダル情報を利用するための新しいスケルトン認識マルチモーダルSLRフレームワーク（SAM-SLR）を提案します。具体的には、埋め込まれたダイナミクスをモデル化するための手話グラフ畳み込みネットワーク（SL-GCN）と、スケルトン機能を活用するための新しい分離可能な時空間畳み込みネットワーク（SSTCN）を提案します。 RGBおよび深度モダリティもフレームワークに組み込まれて組み立てられ、スケルトンベースのメソッドSL-GCNおよびSSTCNを補完するグローバル情報を提供します。その結果、SAM-SLRは、2021年にRGB（98.42％）トラックとRGB-D（98.53％）トラックの両方で最高のパフォーマンスを達成しました。私たちのコードはhttps://github.com/jackyjsy/CVPR21Chal-SLRで入手できます

Sign language is commonly used by deaf or speech impaired people to communicate but requires significant effort to master. Sign Language Recognition (SLR) aims to bridge the gap between sign language users and others by recognizing signs from given videos. It is an essential yet challenging task since sign language is performed with the fast and complex movement of hand gestures, body posture, and even facial expressions. Recently, skeleton-based action recognition attracts increasing attention due to the independence between the subject and background variation. However, skeleton-based SLR is still under exploration due to the lack of annotations on hand keypoints. Some efforts have been made to use hand detectors with pose estimators to extract hand key points and learn to recognize sign language via Neural Networks, but none of them outperforms RGB-based methods. To this end, we propose a novel Skeleton Aware Multi-modal SLR framework (SAM-SLR) to take advantage of multi-modal information towards a higher recognition rate. Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics and a novel Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. RGB and depth modalities are also incorporated and assembled into our framework to provide global information that is complementary to the skeleton-based methods SL-GCN and SSTCN. As a result, SAM-SLR achieves the highest performance in both RGB (98.42%) and RGB-D (98.53%) tracks in 2021 Looking at People Large Scale Signer Independent Isolated SLR Challenge. Our code is available at https://github.com/jackyjsy/CVPR21Chal-SLR

updated: Sun May 02 2021 20:49:40 GMT+0000 (UTC)

published: Tue Mar 16 2021 03:38:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト