Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Pingchuan Ma; Alexandros Haliassos; Adriana Fernandez-Lopez; Honglie Chen; Stavros Petridis; Maja Pantic

Auto-AVSR: 自動ラベルによる視聴覚音声認識

オーディオビジュアル音声認識は、音響ノイズに対するロバスト性から多くの注目を集めています。最近、主に大規模なモデルとトレーニングセットの使用により、自動、視覚、およびオーディオ/ビジュアル音声認識 (それぞれ ASR、VSR、および AV-ASR) のパフォーマンスが大幅に改善されました。ただし、データセットの正確なラベル付けには、時間と費用がかかります。したがって、この作業では、ラベル付けされていないデータセットの自動生成された転写の使用を調査して、トレーニングセットのサイズを増やします。この目的のために、公開されている事前トレーニング済みの ASR モデルを使用して、AVSpeech や VoxCeleb2 などのラベルのないデータセットを自動的に書き起こします。次に、LRS2 および LRS3 データセットと追加の自動転写データで構成される拡張トレーニングセットで、ASR、VSR、および AV-ASR モデルをトレーニングします。文献の最近の傾向であるトレーニングセットのサイズを大きくすると、ノイズの多い書き起こしを使用しているにもかかわらず、WER が減少することがわかりました。提案されたモデルは、LRS2 および LRS3 の AV-ASR で新しい最先端のパフォーマンスを実現します。特に、LRS3 で 0.9% の WER を達成し、現在の最先端のアプローチに対して 30% の相対的な改善を達成し、26 倍のトレーニングデータを使用して非公開のデータセットでトレーニングされた方法よりも優れています。 .

Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. For this purpose, we use publicly-available pre-trained ASR models to automatically transcribe unlabelled datasets such as AVSpeech and VoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented training set, which consists of the LRS2 and LRS3 datasets as well as the additional automatically-transcribed data. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of 0.9% on LRS3, a relative improvement of 30% over the current state-of-the-art approach, and outperforms methods that have been trained on non-publicly available datasets with 26 times more training data.

updated: Wed Jun 28 2023 14:41:17 GMT+0000 (UTC)

published: Sat Mar 25 2023 00:37:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト