Conformers are All You Need for Visual Speech Recogntion

Oscar Chang; Hank Liao; Dmitriy Serdyuk; Ankit Shah; Olivier Siohan

視覚音声認識に必要なのはコンフォーマーだけ

視覚音声認識モデルは、視覚的特徴を階層的に抽出します。下位レベルには、唇や顔を表す生のピクセルを処理する時間的受容野が制限された視覚的フロントエンドがあります。より高いレベルでは、大きな時間的受容野を介してフロントエンドによって生成された埋め込みに注意を向けるエンコーダーがあります。以前の作業では、モデルの視覚的なフロントエンドを改善して、音声認識により有用な機能を抽出することに重点が置かれていました。驚くべきことに、私たちの研究は、複雑なビジュアルフロントエンドが必要ないことを示しています。洗練されたビジュアルフロントエンドにリソースを割り当てる代わりに、より大きな Conformer エンコーダーと組み合わせたリニアビジュアルフロントエンドが、より低いレイテンシー、より効率的なメモリ使用、および改善された WER パフォーマンスをもたらすことがわかりました。 TED LRS3 データセットの視覚音声認識で 12.8% WER という新しい最先端技術を達成しました。これは、わずか 4 年前の音声のみのモデルのパフォーマンスに匹敵します。

Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on improving the visual front-end of the model to extract more useful features for speech recognition. Surprisingly, our work shows that complex visual front-ends are not necessary. Instead of allocating resources to a sophisticated visual front-end, we find that a linear visual front-end paired with a larger Conformer encoder results in lower latency, more efficient memory usage, and improved WER performance. We achieve a new state-of-the-art of 12.8% WER for visual speech recognition on the TED LRS3 dataset, which rivals the performance of audio-only models from just four years ago.

updated: Fri Feb 17 2023 01:31:55 GMT+0000 (UTC)

published: Fri Feb 17 2023 01:31:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト