Image-to-Video Generation via 3D Facial Dynamics

Xiaoguang Tu; Yingtian Zou; Jian Zhao; Wenjie Ai; Jian Dong; Yuan Yao; Zhikang Wang; Guodong Guo; Zhifeng Li; Wei Liu; Jiashi Feng

3D フェイシャルダイナミクスによる画像からビデオへの生成

静止画像からのさまざまな動画生成タスクに対応する汎用モデル FaceAnime を紹介します。単一の顔画像からのビデオ生成は興味深い問題であり、通常は敵対的生成ネットワーク (GAN) を利用して入力顔画像と一連のまばらな顔のランドマークからの情報を統合することで対処されます。ただし、生成された顔画像は、通常、顔のランドマークの表現能力が弱いため、品質の低下、画像の歪み、アイデンティティの変更、および表現の不一致に悩まされます。この論文では、再構成された 3D 顔ダイナミクスに従って単一の顔画像から顔動画を「想像」することを提案し、正確に予測されたポーズと表情で、現実的で同一性を保持する顔動画を生成することを目指しています。 3D ダイナミクスは、顔の表情と動きの変化を明らかにし、非常にリアルな顔のビデオ生成を導くための強力な事前知識として役立ちます。特に、顔のビデオ予測を調査し、適切に設計された 3D 動的予測ネットワークを活用して、単一の顔画像の 3D 動的シーケンスを予測します。次に、スパーステクスチャマッピングアルゴリズムによって 3D ダイナミクスがさらにレンダリングされ、構造の詳細とスパーステクスチャが復元され、顔のフレームが生成されます。私たちのモデルは、顔のビデオのリターゲティングや顔のビデオ予測など、さまざまな AR/VR およびエンターテインメントアプリケーションに汎用性があります。優れた実験結果は、単一のソースの顔画像から、忠実度が高く、アイデンティティを保持し、視覚的に快適な顔のビデオクリップを生成する効果を十分に示しています。

We present a versatile model, FaceAnime, for various video generation tasks from still images. Video generation from a single face image is an interesting problem and usually tackled by utilizing Generative Adversarial Networks (GANs) to integrate information from the input face image and a sequence of sparse facial landmarks. However, the generated face images usually suffer from quality loss, image distortion, identity change, and expression mismatching due to the weak representation capacity of the facial landmarks. In this paper, we propose to "imagine" a face video from a single face image according to the reconstructed 3D face dynamics, aiming to generate a realistic and identity-preserving face video, with precisely predicted pose and facial expression. The 3D dynamics reveal changes of the facial expression and motion, and can serve as a strong prior knowledge for guiding highly realistic face video generation. In particular, we explore face video prediction and exploit a well-designed 3D dynamic prediction network to predict a 3D dynamic sequence for a single face image. The 3D dynamics are then further rendered by the sparse texture mapping algorithm to recover structural details and sparse textures for generating face frames. Our model is versatile for various AR/VR and entertainment applications, such as face video retargeting and face video prediction. Superior experimental results have well demonstrated its effectiveness in generating high-fidelity, identity-preserving, and visually pleasant face video clips from a single source face image.

updated: Mon May 31 2021 02:30:11 GMT+0000 (UTC)

published: Mon May 31 2021 02:30:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト