DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Fa-Ting Hong; Li Shen; Dan Xu

DaGAN++: トーキングヘッドビデオ生成のための深度を認識した敵対的生成ネットワーク

トーキングヘッド生成の主な技術は、入力顔画像からの顔の外観や動きなどの 2D 情報に大きく依存しています。それにもかかわらず、ピクセル単位の深度などの高密度の 3D 顔のジオメトリは、正確な 3D 顔構造を構築し、複雑な背景ノイズを抑制して生成する上で重要な役割を果たします。ただし、顔ビデオの高密度 3D アノテーションを取得するには、法外なコストがかかります。この研究では、まず、トレーニング時にカメラパラメーターや 3D ジオメトリの注釈を必要とせずに、顔ビデオから高密度 3D 顔ジオメトリ (つまり、深さ) を学習するための新しい自己教師ありの方法を紹介します。さらに、幾何学学習のためにより信頼性の高い剛体運動ピクセルを認識するためにピクセルレベルの不確実性を学習する戦略を提案します。次に、効果的なジオメトリに基づいた顔のキーポイント推定モジュールを設計し、モーションフィールドを生成するための正確なキーポイントを提供します。最後に、各生成層に適用できる 3D 対応のクロスモーダル (つまり、外観と深度) 注目メカニズムを開発し、粗い方法から細かい方法で顔の形状をキャプチャします。 3 つの困難なベンチマーク (VoxCeleb1、VoxCeleb2、および HDTF) について広範な実験が行われています。結果は、私たちが提案したフレームワークが、これらのベンチマークに基づいて確立された新しい最先端のパフォーマンスを備えた、非常に現実的な再現トーキングビデオを生成できることを示しています。コードとトレーニング済みモデルは、GitHub プロジェクトページ (https://github.com/harlanhong/CVPR2022-DaGAN) で公開されています。

Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this work, firstly, we present a novel self-supervised method for learning dense 3D facial geometry (ie, depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Secondly, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (ie, VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks. The codes and trained models are publicly available on the GitHub project page at https://github.com/harlanhong/CVPR2022-DaGAN

updated: Wed May 10 2023 14:58:33 GMT+0000 (UTC)

published: Wed May 10 2023 14:58:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト