Using Motion History Images with 3D Convolutional Networks in Isolated Sign Language Recognition

Ozge Mercanoglu Sincan; Hacer Yalim Keles

孤立した手話認識における3D畳み込みネットワークでのモーションヒストリー画像の使用

計算モデルを用いた手話認識は、顔、手、体などの複数のソースの同時時空間モデリングを必要とする挑戦的な問題です。本論文では、モーションを使用して訓練されたモデルに基づく分離手話認識モデルを提案します。 RGBビデオフレームから生成された履歴画像（MHI）。 RGB-MHI画像は、単一のRGB画像で各サインビデオの時空間サマリーを効果的に表します。このRGB-MHIモデルを使用して2つの異なるアプローチを提案します。最初のアプローチでは、3D-CNNアーキテクチャに統合されたモーションベースの空間注意モジュールとしてRGB-MHIモデルを使用します。 2番目のアプローチでは、RGB-MHIモデルの機能を、レイトフュージョン手法を使用した3D-CNNモデルの機能と直接使用します。最近リリースされた2つの大規模な孤立した手話データセット、つまりAUTSLとBosphorusSign22kで広範な実験を行います。私たちの実験は、RGBデータのみを使用する私たちのモデルが、マルチモーダルデータを使用する文献の最先端のモデルと競合できることを示しています。

Sign language recognition using computational models is a challenging problem that requires simultaneous spatio-temporal modeling of the multiple sources, i.e. faces, hands, body, etc. In this paper, we propose an isolated sign language recognition model based on a model trained using Motion History Images (MHI) that are generated from RGB video frames. RGB-MHI images represent spatio-temporal summary of each sign video effectively in a single RGB image. We propose two different approaches using this RGB-MHI model. In the first approach, we use the RGB-MHI model as a motion-based spatial attention module integrated into a 3D-CNN architecture. In the second approach, we use RGB-MHI model features directly with the features of a 3D-CNN model using a late fusion technique. We perform extensive experiments on two recently released large-scale isolated sign language datasets, namely AUTSL and BosphorusSign22k. Our experiments show that our models, which use only RGB data, can compete with the state-of-the-art models in the literature that use multi-modal data.

updated: Fri Feb 18 2022 15:38:54 GMT+0000 (UTC)

published: Sun Oct 24 2021 09:25:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト