Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Suzhen Wang; Lincheng Li; Yu Ding; Changjie Fan; Xin Yu

Audio2Head：自然なヘッドモーションを備えたオーディオ駆動のワンショットトーキングヘッド生成

単一の参照画像から写実的なトーキングヘッドビデオを生成するためのオーディオ駆動トーキングヘッド法を提案します。この作業では、2つの重要な課題に取り組みます。（i）音声韻律に一致する自然な頭の動きを生成すること、および（ii）顔以外の領域を安定させながら、大きな頭の動きで話者の外観を維持することです。最初に、モーションアウェアリカレントニューラルネットワーク（RNN）を使用して剛体の6Dヘッドの動きをモデル化することにより、ヘッドポーズ予測子を設計します。このように、予測された頭のポーズは、話している頭の低周波の全体的な動きとして機能し、後者のネットワークが詳細な顔の動きの生成に集中できるようにします。オーディオから生じる画像の動き全体を表現するために、キーポイントベースの高密度モーションフィールド表現を利用します。次に、入力オーディオ、頭のポーズ、および参照画像から高密度のモーションフィールドを生成するモーションフィールドジェネレータを開発します。このキーポイントベースの表現は、顔の領域、頭、背景の動きを統合的にモデル化するため、私たちの方法は、生成されたビデオの空間的および時間的一貫性をより適切に制約できます。最後に、画像生成ネットワークを使用して、推定されたキーポイントベースのモーションフィールドと入力参照画像からフォトリアリスティックなトーキングヘッドビデオをレンダリングします。広範な実験により、私たちの方法は、もっともらしい頭の動き、同期した顔の表情、安定した背景を備えたビデオを生成し、最先端のパフォーマンスを上回っていることを示しています。

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

updated: Tue Jul 20 2021 07:22:42 GMT+0000 (UTC)

published: Tue Jul 20 2021 07:22:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト