DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

Chenpeng Du; Qi Chen; Xie Chen; Kai Yu

DAE-Talker: 拡散オートエンコーダーを使用した忠実度の高い音声駆動型の話し顔の生成

最近の研究では、発話による会話の顔の生成が大幅に進歩しましたが、生成されたビデオの品質は、実際の録音の品質にまだ遅れをとっています。その理由の 1 つは、顔のランドマークや 3DMM 係数などの手作りの中間表現を使用していることです。これらは人間の知識に基づいて設計されており、顔の動きを正確に記述するには不十分です。さらに、これらの方法では、これらの表現を抽出するための事前トレーニング済みの外部モデルが必要であり、そのパフォーマンスにより、話している顔の生成に上限が設定されます。これらの制限に対処するために、拡散オートエンコーダー (DAE) から取得したデータ駆動型潜在表現を活用する DAE-Talker と呼ばれる新しい方法を提案します。 DAE には、画像を潜在ベクトルにエンコードする画像エンコーダーと、そこから画像を再構築する DDIM 画像デコーダーが含まれています。話している顔のビデオフレームで DAE をトレーニングし、Conformer ベースの speech2latent モデルのトレーニングターゲットとして潜在表現を抽出します。これにより、DAE-Talker は完全なビデオフレームを合成し、テンプレートビデオの所定の頭のポーズに依存するのではなく、スピーチの内容に合わせた自然な頭の動きを生成できます。また、ポーズの制御性のために、speech2latent にポーズモデリングを導入します。さらに、個々のフレームでトレーニングされた DDIM イメージデコーダーを使用して連続ビデオフレームを生成する新しい方法を提案し、連続フレームの同時分布を直接モデル化する必要をなくします。私たちの実験では、DAE-Talker がリップシンク、ビデオの忠実度、ポーズの自然さにおいて、既存の一般的な方法よりも優れていることが示されています。また、提案された技術の有効性を分析し、DAE-Talker のポーズ制御性を実証するためにアブレーション研究を実施します。

While recent research has made significant progress in speech-driven talking face generation, the quality of the generated video still lags behind that of real recordings. One reason for this is the use of handcrafted intermediate representations like facial landmarks and 3DMM coefficients, which are designed based on human knowledge and are insufficient to precisely describe facial movements. Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that encodes an image into a latent vector and a DDIM image decoder that reconstructs the image from it. We train our DAE on talking face video frames and then extract their latent representations as the training target for a Conformer-based speech2latent model. This allows DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech, rather than relying on a predetermined head pose from a template video. We also introduce pose modelling in speech2latent for pose controllability. Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness. We also conduct ablation studies to analyze the effectiveness of the proposed techniques and demonstrate the pose controllability of DAE-Talker.

updated: Fri Mar 01 2024 11:43:46 GMT+0000 (UTC)

published: Thu Mar 30 2023 17:18:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト