Modelling Lips-State Detection Using CNN for Non-Verbal Communications

Abtahi Ishmam; Mahmudul Hasan; Md. Saif Hassan Onim; Koushik Roy; Md. Akiful Haque Akif; Hossain Nyeem

非言語コミュニケーションのためのCNNを使用した唇状態検出のモデリング

視覚ベースの深層学習モデルは、音声と聴覚に障害のある秘密のコミュニケーションに有望です。このような非言語的コミュニケーションは主に手振りと顔の表情で調査されますが、唇の状態（つまり、開閉）ベースの解釈/翻訳システムについては、これまでのところ調査の取り組みは追跡されていません。この開発をサポートするために、このペーパーでは、唇の状態を検出するための2つの新しい畳み込みニューラルネットワーク（CNN）モデルについて報告します。 2つの著名な唇のランドマーク検出器であるDLIBとMediaPipeに基づいて、6つの主要なランドマークのセットを使用して唇の状態モデルを単純化し、それらの距離を唇の状態の分類に使用します。これにより、両方のモデルが唇の開閉をカウントするように開発されており、したがって、合計カウントでシンボルを分類できます。モデルの有効性を判断するために、さまざまなフレームレート、唇の動き、顔の角度が調査されます。初期の実験結果は、DLIBを使用したモデルが平均6フレーム/秒（FPS）で比較的遅く、平均検出精度が95.25％高いことを示しています。対照的に、MediaPipeを使用したモデルは、平均FPSが20、検出精度が94.4％の、より高速なランドマーク検出機能を提供します。したがって、両方のモデルは、非言語的セマンティクスの唇の状態を自然言語に効果的に解釈できます。

Vision-based deep learning models can be promising for speech-and-hearing-impaired and secret communications. While such non-verbal communications are primarily investigated with hand-gestures and facial expressions, no research endeavour is tracked so far for the lips state (i.e., open/close)-based interpretation/translation system. In support of this development, this paper reports two new Convolutional Neural Network (CNN) models for lips state detection. Building upon two prominent lips landmark detectors, DLIB and MediaPipe, we simplify lips-state model with a set of six key landmarks, and use their distances for the lips state classification. Thereby, both the models are developed to count the opening and closing of lips and thus, they can classify a symbol with the total count. Varying frame-rates, lips-movements and face-angles are investigated to determine the effectiveness of the models. Our early experimental results demonstrate that the model with DLIB is relatively slower in terms of an average of 6 frames per second (FPS) and higher average detection accuracy of 95.25%. In contrast, the model with MediaPipe offers faster landmark detection capability with an average FPS of 20 and detection accuracy of 94.4%. Both models thus could effectively interpret the lips state for non-verbal semantics into a natural language.

updated: Sat Dec 11 2021 15:14:03 GMT+0000 (UTC)

published: Thu Dec 09 2021 08:16:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト