FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang; Zhou Zhao

FastLTS：非自己回帰のエンドツーエンドの制約のないリップツースピーチ合成

制約のない唇から音声への合成は、頭のポーズや語彙に制限がなく、話している顔のサイレントビデオから対応する音声を生成することを目的としています。現在の作業では、主にシーケンス間モデルを使用して、自己回帰アーキテクチャまたはフローベースの非自己回帰アーキテクチャのいずれかでこの問題を解決しています。ただし、これらのモデルにはいくつかの欠点があります。1）オーディオを直接生成する代わりに、最初にメルスペクトログラムを生成し、次にスペクトログラムからオーディオを再構築する2段階のパイプラインを使用します。これにより、エラーの伝播により、煩雑な展開と音声品質の低下が発生します。 2）これらのモデルで使用される音声再構成アルゴリズムは、推論速度と音声品質を制限しますが、出力スペクトログラムが十分に正確でないため、これらのモデルではニューラルボコーダーを使用できません。 3）自己回帰モデルは推論の待ち時間が長くなりますが、フローベースのモデルはメモリ占有率が高くなります。どちらも時間とメモリ使用量の両方で十分に効率的ではありません。これらの問題に取り組むために、我々はFastLTSを提案します。これは、低遅延で制約のない会話ビデオから高品質の音声オーディオを直接合成でき、モデルサイズが比較的小さい非自己回帰エンドツーエンドモデルです。さらに、唇の動きをエンコードするために広く使用されている3D-CNNビジュアルフロントエンドとは異なり、このタスク用のトランスベースのビジュアルフロントエンドを初めて提案します。実験によると、私たちのモデルは、3秒の入力シーケンスで現在の自己回帰モデルと比較してオーディオ波形生成の19.76倍の速度向上を達成し、優れたオーディオ品質を取得します。

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves 19.76× speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.

updated: Fri Jul 08 2022 10:10:39 GMT+0000 (UTC)

published: Fri Jul 08 2022 10:10:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト