One-shot Implicit Animatable Avatars with Model-based Priors

Yangyi Huang; Hongwei Yi; Weiyang Liu; Haofan Wang; Boxi Wu; Wenxiao Wang; Binbin Lin; Debing Zhang; Deng Cai

モデルベースの事前確率を使用したワンショットの暗黙的なアニメーション化可能なアバター

人間のアバターを作成するための既存のニューラルレンダリング方法は通常、ビデオやマルチビュー画像などの高密度の入力信号を必要とするか、またはスパースビューの入力で再構成を実行できるように大規模な特定の 3D ヒューマンデータセットから学習した事前学習を活用するかのいずれかです。これらの方法のほとんどは、単一の画像しか利用できない場合、現実的な再構成を実現できません。リアルでアニメーション可能な 3D 人間をデータ効率よく作成できるようにするために、単一の画像から人間特有の神経放射場を学習する新しい方法である ELICIT を提案します。人間は 1 枚の画像から体の形状を簡単に推定し、全身の衣服を想像できるという事実に触発され、ELICIT では 2 つの事前分布 (3D 形状事前分布と視覚的意味論的事前分布) を活用します。具体的には、ELICIT は、スキン頂点ベースのテンプレートモデル (つまり SMPL) から事前に 3D 身体形状ジオメトリを利用し、CLIP ベースの事前トレーニング済みモデルで視覚的な服装セマンティクスを事前に実装します。両方の事前分布は、目に見えない領域で妥当なコンテンツを作成するための最適化を共同でガイドするために使用されます。 CLIP モデルを利用して、ELICIT はテキスト記述を使用して、テキスト条件付きの見えない領域を生成できます。視覚的な詳細をさらに改善するために、アバターのさまざまな部分を局所的に調整するセグメンテーションベースのサンプリング戦略を提案します。 ZJU-MoCAP、Human3.6M、DeepFashion などの複数の人気ベンチマークの包括的な評価では、単一の画像しか使用できない場合、ELICIT がアバター作成の強力なベースライン方法を上回っていることが示されています。コードは研究目的で https://huangyangyi.github.io/ELICIT/ で公開されています。

Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can effortlessly estimate the body geometry and imagine full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT utilizes the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pretrained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. Taking advantage of the CLIP models, ELICIT can use text descriptions to generate text-conditioned unseen regions. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed strong baseline methods of avatar creation when only a single image is available. The code is public for research purposes at https://huangyangyi.github.io/ELICIT/.

updated: Mon Aug 21 2023 08:59:06 GMT+0000 (UTC)

published: Mon Dec 05 2022 18:24:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト