DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis

Shuai Shen; Wenliang Zhao; Zibin Meng; Wanhua Li; Zheng Zhu; Jie Zhou; Jiwen Lu

DiffTalk: 一般化されたトーキングヘッド合成のための拡散モデルの作成

トーキングヘッド合成は、ビデオ制作業界にとって有望なアプローチです。最近、この研究分野では、生成の品質を改善したり、モデルの一般化を強化したりするために多くの努力が払われてきました。ただし、実用化に不可欠な両方の問題を同時に解決できる研究はほとんどありません。この目的のために、この論文では、新たに出現した強力な潜在拡散モデルに注目し、トーキングヘッドの生成をオーディオ主導の時間的にコヒーレントなノイズ除去プロセス (DiffTalk) としてモデル化します。具体的には、単一の駆動要因として音声信号を使用する代わりに、話している顔の制御メカニズムを調査し、参照顔画像とランドマークをパーソナリティを考慮した一般化合成の条件として組み込みます。このように、提案された DiffTalk は、ソースオーディオと同期して高品質のトーキングヘッドビデオを生成することができます。さらに重要なことは、さらに微調整することなく、さまざまなアイデンティティ間で自然に一般化できることです。さらに、当社の DiffTalk は、わずかな追加の計算コストで高解像度の合成に合わせて適切に調整できます。広範な実験により、提案された DiffTalk が、一般化された新しいアイデンティティーのために、忠実度の高いオーディオ主導のトーキングヘッドビデオを効率的に合成することが示されています。ビデオの結果の詳細については、このデモンストレーション https://cloud.tsinghua.edu.cn/f/e13f5aad2f4c4f898ae7/ を参照してください。

Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to this demonstration https://cloud.tsinghua.edu.cn/f/e13f5aad2f4c4f898ae7/.

updated: Tue Jan 10 2023 05:11:25 GMT+0000 (UTC)

published: Tue Jan 10 2023 05:11:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト