Natural Language-Assisted Sign Language Recognition

Ronglai Zuo; Fangyun Wei; Brian Mak

自然言語支援手話認識

手話は、手話や顔の表情、体の動きなどによって情報を伝える視覚言語です。これらの視覚要素の組み合わせには固有の制限があるため、手話には多数の視覚的に区別できない記号 (VISign) が存在し、視覚ニューラルネットワークの認識能力が制限されます。この問題を軽減するために、グロス (標識ラベル) に含まれる意味情報を利用する自然言語支援手話認識 (NLA-SLR) フレームワークを提案します。まず、同様のセマンティックな意味を持つ VISign の場合、トレーニングを容易にするために、グロス間の正規化されたセマンティックの類似性から平滑化の重みが計算されるトレーニングサインごとにソフトラベルを生成することにより、言語認識ラベルスムージングを提案します。第二に、明確なセマンティックな意味を持つ VISigns の場合、ブレンドされたラベルの監督下で異なる記号の分離性をさらに最大化するために、ビジョンと光沢の機能をブレンドするモダリティ間ミックスアップ手法を提示します。さらに、RGB ビデオと人体のキーポイントの両方をモデル化するだけでなく、さまざまな時間的受容野のサインビデオから知識を導き出す、新しいバックボーン、ビデオキーポイントネットワークも紹介します。経験的に、私たちの方法は、広く採用されている 3 つのベンチマーク (MSASL、WLASL、および NMFs-CSL) で最先端のパフォーマンスを達成しています。コードは https://github.com/FangyunWei/SLRT で入手できます。

Sign languages are visual languages which convey information by signers' handshape, facial expression, body movement, and so forth. Due to the inherent restriction of combinations of these visual ingredients, there exist a significant number of visually indistinguishable signs (VISigns) in sign languages, which limits the recognition capacity of vision neural networks. To mitigate the problem, we propose the Natural Language-Assisted Sign Language Recognition (NLA-SLR) framework, which exploits semantic information contained in glosses (sign labels). First, for VISigns with similar semantic meanings, we propose language-aware label smoothing by generating soft labels for each training sign whose smoothing weights are computed from the normalized semantic similarities among the glosses to ease training. Second, for VISigns with distinct semantic meanings, we present an inter-modality mixup technique which blends vision and gloss features to further maximize the separability of different signs under the supervision of blended labels. Besides, we also introduce a novel backbone, video-keypoint network, which not only models both RGB videos and human body keypoints but also derives knowledge from sign videos of different temporal receptive fields. Empirically, our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL. Codes are available at https://github.com/FangyunWei/SLRT.

updated: Tue Mar 21 2023 17:59:57 GMT+0000 (UTC)

published: Tue Mar 21 2023 17:59:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト