Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

Se-Yun Um; Jihyun Kim; Jihyun Lee; Hong-Goo Kang

Facetron: クロスモーダル潜在表現に基づくマルチスピーカーの対面音声モデル

この論文では、見えない話者の状況でも機能するマルチ話者対面音声波形生成モデルを提案します。補助条件として言語と話者の特徴を備えた敵対的生成ネットワーク (GAN) を使用して、この方法は、エンドツーエンドのトレーニングフレームワークの下で、顔画像を音声波形に直接変換します。言語特徴は、読唇モデルを使用して唇の動きから抽出され、話者の特徴は、事前トレーニング済みの音響モデルを使用したクロスモーダル学習を使用して、顔画像から予測されます。この 2 つの特徴は無相関であり、独立して制御されるため、入力された顔画像によって話者特性が異なる音声波形を柔軟に合成できます。客観的および主観的な評価結果の観点から、提案モデルが従来の方法よりも優れていることを示します。具体的には、自動音声認識タスクで精度を測定することにより、言語機能のパフォーマンスを評価します。さらに、マルチスピーカーと目に見えない条件について、スピーカーと性別の類似性をそれぞれ推定します。また、平均オピニオンスコア (MOS) テストと非侵入型客観的音声品質評価 (NISQA) を使用して、合成された音声波形の自然性を評価します。提案されたモデルと他のモデルのデモサンプルは、https://sam-0927 で入手できます。 .github.io/

In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the aturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).The demo samples of the proposed and other models are available at https://sam-0927.github.io/

updated: Wed Mar 15 2023 12:28:22 GMT+0000 (UTC)

published: Mon Jul 26 2021 07:36:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト