Explicitly Controllable 3D-Aware Portrait Generation

Junshu Tang; Bo Zhang; Binxin Yang; Ting Zhang; Dong Chen; Lizhuang Ma; Fang Wen

明示的に制御可能な 3D 対応のポートレート生成

コストのかかるプロセスである従来のアバター作成パイプラインとは対照的に、現代の生成的アプローチは写真からデータ分布を直接学習し、最先端の技術により非常に写真のようにリアルな画像を生成できるようになりました。多くの作品が無条件の生成モデルを拡張し、ある程度の制御性を達成しようと試みていますが、特に大きなポーズでは、マルチビューの一貫性を確保することは依然として困難です。この作業では、ポーズ、アイデンティティ、表現、および照明に関するセマンティックパラメーターに従って制御可能でありながら、3D の一貫したポートレートを生成する 3D ポートレート生成ネットワークを提案します。ジェネレーティブネットワークは、ニューラルシーン表現を使用して 3D でポートレートをモデル化し、その生成は、明示的な制御をサポートするパラメトリックな顔モデルによってガイドされます。潜在的なもつれの解消は、部分的に異なる属性を持つ画像を対比することでさらに強化できますが、表情をアニメートする場合、髪や背景などの顔以外の領域には依然として顕著な不一致が存在します。これを解決するには、動的および静的な放射輝度フィールドをブレンドして複合出力を形成するボリュームブレンディング戦略を提案し、共同で学習したセマンティックフィールドからセグメント化された 2 つの部分を使用します。私たちの方法は、自由な視点で見たときに自然光の中で鮮やかな表現を持つリアルなポートレートを生成し、広範な実験で先行技術を凌駕しています。提案された方法は、実際の画像とドメイン外の漫画の顔への一般化能力も示しており、実際のアプリケーションで大きな可能性を示しています。追加のビデオ結果とコードは、プロジェクトの Web ページで入手できます。

In contrast to the traditional avatar creation pipeline which is a costly process, contemporary generative approaches directly learn the data distribution from photographs and the state of the arts can now yield highly photo-realistic images. While plenty of works attempt to extend the unconditional generative models and achieve some level of controllability, it is still challenging to ensure multi-view consistency, especially in large poses. In this work, we propose a 3D portrait generation network that produces 3D consistent portraits while being controllable according to semantic parameters regarding pose, identity, expression and lighting. The generative network uses neural scene representation to model portraits in 3D, whose generation is guided by a parametric face model that supports explicit control. While the latent disentanglement can be further enhanced by contrasting images with partially different attributes, there still exists noticeable inconsistency in non-face areas, e.g., hair and background, when animating expressions. We solve this by proposing a volume blending strategy in which we form a composite output by blending the dynamic and static radiance fields, with two parts segmented from the jointly learned semantic field. Our method outperforms prior arts in extensive experiments, producing realistic portraits with vivid expression in natural lighting when viewed in free viewpoint. The proposed method also demonstrates generalization ability to real images as well as out-of-domain cartoon faces, showing great promise in real applications. Additional video results and code will be available on the project webpage.

updated: Mon Sep 12 2022 17:40:08 GMT+0000 (UTC)

published: Mon Sep 12 2022 17:40:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト