Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

Zhenhui Ye; Ziyue Jiang; Yi Ren; Jinglin Liu; Chen Zhang; Xiang Yin; Zejun Ma; Zhou Zhao

Ada-TTA: 適応型の高品質テキスト読み上げアバター合成に向けて

私たちは、新しいタスク、つまり、低リソースのテキストで会話するアバターに興味があります。音声トラックを学習データとして、任意のテキストを運転入力として持つ数分の会話人物ビデオのみを与えて、入力テキストに対応する高品質の会話ポートレートビデオを合成することを目指します。このタスクには、デジタルヒューマン業界での幅広い応用の可能性がありますが、次の 2 つの課題があるため、技術的にはまだ達成されていません。 (1) 従来のマルチスピーカーのドメイン外オーディオからの音色を模倣するのは困難です。音声システム。 (2) 限られたトレーニングデータでは、高忠実度で口パクで話すアバターをレンダリングするのは困難です。この論文では、Adaptive Text-to-Talking Avatar (Ada-TTA) を紹介します。これは、(1) テキストの内容、音色、韻律をうまく解きほぐす、汎用のゼロショットマルチスピーカー TTS モデルを設計します。 (2) ニューラルレンダリングの最近の進歩を取り入れて、リアルなオーディオ駆動の話し顔ビデオ生成を実現します。これらの設計により、私たちの方法は前述の 2 つの課題を克服し、アイデンティティを保持した音声と現実的な話者のビデオを生成することを実現します。実験により、私たちの方法が現実的でアイデンティティを保持し、視聴覚が同期したトーキングアバタービデオを合成できることが実証されました。

We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that well disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.

updated: Tue Jun 06 2023 08:50:13 GMT+0000 (UTC)

published: Tue Jun 06 2023 08:50:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト