Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

Hang Chen; Jun Du; Yu Hu; Li-Rong Dai; Chin-Hui Lee; Bao-Cai Yin

階層的ピラミッド畳み込みと自己注意による読唇術

この論文では、単語レベルの読唇術を改善するための新しい深層学習アーキテクチャを提案します。一方では、まず、読唇術の空間特徴抽出にマルチスケール処理を導入します。特に、元のモジュールの標準畳み込みを置き換える階層型ピラミッド畳み込み（HPConv）を提案しました。これにより、モデルのきめの細かい唇の動きを検出する機能が向上しました。一方、自己注意を利用してシーケンスのすべてのタイムステップで情報をマージし、モデルが関連するフレームにより多くの注意を払うようにします。これらの2つの利点を組み合わせて、モデルの分類力をさらに強化します。野生の読唇術（LRW）データセットに関する実験では、提案されたモデルが86.83％の精度を達成し、現在の最先端技術に比べて1.53％の絶対的な改善が得られたことが示されています。また、提案されたモデルの動作をよりよく理解するために、広範な実験を実施しました。

In this paper, we propose a novel deep learning architecture to improving word-level lip-reading. On the one hand, we first introduce the multi-scale processing into the spatial feature extraction for lip-reading. Specially, we proposed hierarchical pyramidal convolution (HPConv) to replace the standard convolution in original module, leading to improvements over the model's ability to discover fine-grained lip movements. On the other hand, we merge information in all time steps of the sequence by utilizing self-attention, to make the model pay more attention to the relevant frames. These two advantages are combined together to further enhance the model's classification power. Experiments on the Lip Reading in the Wild (LRW) dataset show that our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art. We also conducted extensive experiments to better understand the behavior of the proposed model.

updated: Mon Dec 28 2020 16:55:51 GMT+0000 (UTC)

published: Mon Dec 28 2020 16:55:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト