Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Bowen Shi; Wei-Ning Hsu; Kushal Lakhotia; Abdelrahman Mohamed

マスクされたマルチモーダルクラスター予測による視聴覚音声表現の学習

音声のビデオ録画には、相関する音声情報と視覚情報が含まれており、話者の唇の動きと生成された音から学習する音声表現に強力な信号を提供します。視聴覚音声の自己監視表現学習フレームワークである視聴覚隠しユニットBERT（AV-HuBERT）を紹介します。これは、マルチストリームビデオ入力をマスクし、自動的に検出され、反復的に洗練されたマルチモーダル隠しユニットを予測します。 AV-HuBERTは、読唇術と自動音声認識の両方に役立つ強力な視聴覚音声表現を学習します。最大の公開読唇ベンチマークLRS3（433時間）では、AV-HuBERTはわずか30時間のラベル付きデータで32.5％のWERを達成し、1000倍以上のトレーニングを受けた以前の最先端のアプローチ（33.6％）を上回っています。転記されたビデオデータ（31K時間）。 LRS3からの433時間のラベル付きデータをすべて使用し、セルフトレーニングと組み合わせると、読唇術のWERはさらに26.9％に減少します。音声のみの音声認識に同じベンチマークで視聴覚表現を使用すると、最先端のパフォーマンスよりも相対的なWERが40％削減されます（1.3％対2.3％）。コードとモデルはhttps://github.com/facebookresearch/av_hubertで入手できます。

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert

updated: Wed Jan 05 2022 17:40:45 GMT+0000 (UTC)

published: Wed Jan 05 2022 17:40:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト