SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Xubo Liu; Egor Lakomkin; Konstantinos Vougioukas; Pingchuan Ma; Honglie Chen; Ruiming Xie; Morrie Doulaty; Niko Moritz; Jáchym Kolář; Stavros Petridis; Maja Pantic; Christian Fuegen

SynthVSR: 合成教師による視覚音声認識のスケールアップ

最近報告された視覚的音声認識 (VSR) の最先端の結果は、ますます大量のビデオデータに依存することが多く、公開されている文字起こしされたビデオデータセットのサイズは限られています。このホワイトペーパーでは、初めて、VSR の合成ビジュアルデータを活用する可能性を検討します。 SynthVSR と呼ばれる私たちの方法は、合成唇の動きで VSR システムのパフォーマンスを大幅に向上させます。 SynthVSR の背後にある重要なアイデアは、入力音声に条件付けられた唇の動きを生成する、音声駆動型の唇アニメーションモデルを活用することです。音声駆動のリップアニメーションモデルは、ラベル付けされていない視聴覚データセットでトレーニングされており、ラベル付けされたビデオが利用可能な場合、事前トレーニング済みの VSR モデルに向けてさらに最適化できます。転写された音響データと顔画像が豊富にあるため、半教師あり VSR トレーニング用に提案されたリップアニメーションモデルを使用して、大規模な合成データを生成できます。最大のパブリック VSR ベンチマークである Lip Reading Sentences 3 (LRS3) でのアプローチのパフォーマンスを評価します。 SynthVSR は、わずか 30 時間の実際のラベル付きデータで 43.3% の WER を達成し、数千時間のビデオを使用する市販のアプローチよりも優れています。 LRS3 からの 438 時間のラベル付きデータすべてを使用すると、WER はさらに 27.9% に減少します。これは、最先端の自己監視型 AV-HuBERT メソッドと同等です。さらに、大規模な疑似ラベル付けされた視聴覚データと組み合わせると、SynthVSR は、公開されているデータのみを使用して 16.9% の新しい最先端の VSR WER を生成し、でトレーニングされた最近の最先端のアプローチを上回ります。非公開の機械文字起こしビデオデータの 29 倍 (90,000 時間)。最後に、提案された方法の各コンポーネントの効果を理解するために、広範なアブレーション研究を実行します。

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method.

updated: Thu Mar 30 2023 07:43:27 GMT+0000 (UTC)

published: Thu Mar 30 2023 07:43:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト