One-shot Implicit Animatable Avatars with Model-based Priors

Yangyi Huang; Hongwei Yi; Weiyang Liu; Haofan Wang; Boxi Wu; Wenxiao Wang; Binbin Lin; Debing Zhang; Deng Cai

モデルベースのプライアを使用したワンショットの暗黙的でアニメーション化可能なアバター

人間のアバターを作成するための既存のニューラルレンダリング方法は、通常、ビデオやマルチビュー画像などの高密度の入力信号を必要とするか、大規模な特定の 3D 人間データセットから学習した事前情報を活用して、疎ビュー入力で再構成を実行できるようにします。単一の画像しか利用できない場合、これらの方法のほとんどは現実的な再構成を達成できません。現実的なアニメーション化可能な 3D 人間のデータ効率的な作成を可能にするために、単一の画像から人間固有の神経放射場を学習するための新しい方法である ELICIT を提案します。人間が体の形状を簡単に推定し、1 つの画像から全身の衣服を想像できるという事実に着想を得て、ELICIT では 2 つの優先順位 (3D 形状の優先順位と視覚的なセマンティック優先順位) を活用しています。具体的には、ELICIT はスキン頂点ベースのテンプレートモデル (つまり、SMPL) からの 3D ボディシェイプジオメトリプライアを利用し、CLIP ベースの事前トレーニング済みモデルを使用してビジュアルクロージングセマンティックプライアを実装します。両方の事前確率を使用して、見えない領域にもっともらしいコンテンツを作成するための最適化を共同で導きます。 CLIP モデルを利用して、ELICIT はテキスト記述を使用して、テキスト条件付きの見えない領域を生成できます。視覚的な詳細をさらに改善するために、アバターのさまざまな部分を局所的に改良するセグメンテーションベースのサンプリング戦略を提案します。 ZJU-MoCAP、Human3.6M、DeepFashion など、複数の一般的なベンチマークでの包括的な評価は、1 つの画像しか利用できない場合、ELICIT がアバター作成の強力なベースライン方法よりも優れていることを示しています。コードは、研究目的で https://elicit3d.github.io/ で公開されています。

Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can effortlessly estimate the body geometry and imagine full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT utilizes the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. Taking advantage of the CLIP models, ELICIT can use text descriptions to generate text-conditioned unseen regions. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed strong baseline methods of avatar creation when only a single image is available. The code is public for research purposes at https://elicit3d.github.io/

updated: Thu Mar 16 2023 09:59:52 GMT+0000 (UTC)

published: Mon Dec 05 2022 18:24:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト