SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Wenxuan Zhang; Xiaodong Cun; Xuan Wang; Yong Zhang; Xi Shen; Yu Guo; Ying Shan; Fei Wang

SadTalker: 定型化されたオーディオ主導の単一画像の話し顔アニメーションの現実的な 3D 運動係数の学習

顔の画像とスピーチの音声からトーキングヘッドビデオを生成するには、まだ多くの課題があります。すなわち、不自然な頭の動き、ゆがんだ表情、同一性の改変などです。これらの問題は主に、結合された 2D モーションフィールドからの学習が原因であると主張します。一方で、3D 情報を明示的に使用すると、表現が硬直したり、映像に一貫性がなくなったりするという問題もあります。音声から 3DMM の 3D モーション係数 (頭のポーズ、表情) を生成し、しゃべる頭の生成のために新しい 3D 認識顔レンダリングを暗黙的に変調する SadTalker を紹介します。リアルなモーション係数を学習するために、オーディオとさまざまなタイプのモーション係数の間の接続を個別に明示的にモデル化します。正確には、係数と 3D レンダリングされた顔の両方を抽出することにより、音声から正確な表情を学習する ExpNet を提示します。頭のポーズに関しては、条件付き VAE を介して PoseVAE を設計し、さまざまなスタイルで頭の動きを合成します。最後に、生成された 3D モーション係数は、提案された顔レンダリングの教師なし 3D キーポイント空間にマッピングされ、最終的なビデオを合成します。モーションとビデオの品質に関して、この方法の優れていることを示すために、広範な実験を行います。

Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video. We conduct extensive experiments to show the superior of our method in terms of motion and video quality.

updated: Tue Nov 22 2022 11:35:07 GMT+0000 (UTC)

published: Tue Nov 22 2022 11:35:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト