Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

Dmitriy Serdyuk; Otavio Braga; Olivier Siohan

1 人および複数人のビデオの視聴覚音声認識のためのトランスベースのビデオフロントエンド

オーディオビジュアル自動音声認識 (AV-ASR) は、追加の情報源としてビデオモダリティを導入することにより、音声認識を拡張します。この作品では、スピーカーの口の動きに含まれる情報を使用して、オーディオ機能を強化します。ビデオモダリティは、伝統的に 3D 畳み込みニューラルネットワーク (VGG の 3D バージョンなど) で処理されます。最近、画像変換ネットワーク arXiv:2010.11929 は、画像分類タスクのために豊富な視覚的特徴を抽出する機能を実証しました。ここでは、視覚的特徴を抽出するために、3D 畳み込みをビデオトランスフォーマーに置き換えることを提案します。 YouTube ビデオの大規模なコーパスでベースラインと提案されたモデルをトレーニングします。私たちのアプローチのパフォーマンスは、YouTube ビデオのラベル付きサブセットと LRS3-TED パブリックコーパスで評価されます。私たちの最高のビデオ専用モデルは、YTDEV18 で 34.9%、LRS3-TED で 19.3% の WER を取得し、畳み込みベースラインよりも 10% と 9% 相対的に改善しています。モデルを微調整した後、LRS3-TEDでオーディオビジュアル認識の最先端のパフォーマンスを達成しました（1.6％WER）。さらに、複数人の AV-ASR に関する一連の実験では、畳み込みビデオフロントエンドよりも平均で 2% の WER の削減が得られました。

Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 34.9% WER on YTDEV18 and 19.3% on LRS3-TED, a 10% and 9% relative improvements over our convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER). In addition, in a series of experiments on multi-person AV-ASR, we obtained an average relative reduction of 2% WER over our convolutional video frontend.

updated: Wed Sep 14 2022 14:39:24 GMT+0000 (UTC)

published: Tue Jan 25 2022 16:35:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト