Sub-word Level Lip Reading With Visual Attention

K R Prajwal; Triantafyllos Afouras; Andrew Zisserman

視覚的注意を払ったサブワードレベルの読唇術

この論文の目的は、サイレントビデオの音声を認識できる強力な読唇モデルを学ぶことです。ほとんどの以前の作品は、自明にプールされた視覚的特徴に加えて既存の自動音声認識技術を適応させることによって、オープンセットの視覚的音声認識問題を扱っています。代わりに、この論文では、読唇術で遭遇する固有の課題に焦点を当て、調整されたソリューションを提案します。この目的のために、私たちは以下の貢献をします。（1）視覚的音声表現を集約するための注意ベースのプーリングメカニズムを提案します。（2）初めて読唇術にサブワード単位を使用し、これによりタスクのあいまいさをより適切にモデル化できることを示します。（3）読唇ネットワーク上で訓練された視覚音声検出（VSD）のモデルを提案します。上記に続いて、公開データセットでトレーニングするときに、挑戦的なLRS2およびLRS3ベンチマークで最新の結果を取得し、さらに、桁違いに少ないデータを使用して、大規模な産業データセットでトレーニングされたモデルを上回ります。私たちの最良のモデルは、LRS2データセットで22.6％の単語誤り率を達成します。これは、読唇術モデルでは前例のないパフォーマンスであり、読唇術と自動音声認識の間のパフォーマンスギャップを大幅に削減します。さらに、AVA-ActiveSpeakerベンチマークでは、VSDモデルはすべての視覚のみのベースラインを上回り、最近のいくつかの視聴覚手法よりも優れています。

The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features. Instead, in this paper we focus on the unique challenges encountered in lip reading and propose tailored solutions. To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network. Following the above, we obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets, and even surpass models trained on large-scale industrial datasets by using an order of magnitude less data. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models, significantly reducing the performance gap between lip reading and automatic speech recognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD model surpasses all visual-only baselines and even outperforms several recent audio-visual methods.

updated: Fri Dec 03 2021 11:35:51 GMT+0000 (UTC)

published: Thu Oct 14 2021 17:59:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト