Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

Se-Yun Um; Jihyun Kim; Jihyun Lee; Sangshin Oh; Kyungguen Byun; Hong-Goo Kang

Facetron：クロスモーダル潜在表現に基づくマルチスピーカー対談モデル

本論文では、個人の顔のビデオを条件付けすることにより、話者固有の音声波形を合成する効果的な方法を提案します。補助条件として言語的および話者特性機能を備えた生成的敵対的ネットワーク（GAN）を使用して、私たちの方法は、エンドツーエンドのトレーニングフレームワークの下で顔画像を音声波形に直接変換します。言語的特徴は、読唇モデルを使用して唇の動きから抽出され、話者の特徴的特徴は、事前に訓練された音響モデルを用いたクロスモーダル学習を使用して顔画像から予測されます。これら2つの特徴は無相関で独立して制御されているため、入力された顔画像によって話者の特性が異なる音声波形を柔軟に合成できます。したがって、私たちの方法は、マルチスピーカーの対面波形モデルと見なすことができます。客観的および主観的な評価結果の両方の観点から、従来の方法に対する提案モデルの優位性を示します。具体的には、自動音声認識タスクと自動話者/性別認識タスクの精度をそれぞれ測定することにより、言語機能と話者特性生成モジュールのパフォーマンスを評価します。また、平均オピニオン評点（MOS）テストを使用して、合成された音声波形の自然さを評価します。

In this paper, we propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. Therefore, our method can be regarded as a multi-speaker face-to-speech waveform model. We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results. Specifically, we evaluate the performances of the linguistic feature and the speaker characteristic generation modules by measuring the accuracy of automatic speech recognition and automatic speaker/gender recognition tasks, respectively. We also evaluate the naturalness of the synthesized speech waveforms using a mean opinion score (MOS) test.

updated: Mon Jul 26 2021 07:36:02 GMT+0000 (UTC)

published: Mon Jul 26 2021 07:36:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト