AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion

Shuo Huang; Zongxin Yang; Liangting Li; Yi Yang; Jia Jia

AvatarFusion: 2D 拡散を使用した、衣服を分離した 3D アバターのゼロショット生成

大規模な事前トレーニング済み視覚言語モデルにより、テキストベースの 3D アバターをゼロショットで生成できます。以前の最先端の方法では、CLIP を利用して人体メッシュを再構成するニューラル暗黙的モデルを監視していました。ただし、このアプローチには 2 つの制限があります。まず、アバター固有のモデルが不足しているため、生成されたアバターに顔の歪みや非現実的な服装が発生する可能性があります。第 2 に、CLIP は全体的な外観の最適化の方向のみを提供するため、印象に残る結果が得られます。これらの制限に対処するために、私たちは、潜在拡散モデルを使用して人間の現実的なアバターを生成すると同時にアバターの身体から衣服をセグメント化するためのピクセルレベルのガイダンスを提供する最初のフレームワークである AvatarFusion を提案します。 AvatarFusion には、新しいデュアルボリュームレンダリング戦略を採用して、分離された皮膚と衣服のサブモデルを 1 つの空間でレンダリングする、最初の衣服分離ニューラル暗黙的アバターモデルが含まれています。また、Pixel-Semantics Difference-Sampling (PS-DS) と呼ばれる新しい最適化手法も紹介します。これは、身体と衣服の生成を意味論的に分離し、さまざまな衣服のスタイルを生成します。さらに、ゼロショットでテキストからアバターを生成するための最初のベンチマークを確立します。私たちの実験結果は、私たちのフレームワークが以前のアプローチよりも優れたパフォーマンスを示し、すべての指標で大幅な改善が観察されたことを示しています。さらに、モデルは衣服を分離しているため、アバターの衣服を交換することができます。コードはGithubで入手可能になります。

Large-scale pre-trained vision-language models allow for the zero-shot text-based generation of 3D avatars. The previous state-of-the-art method utilized CLIP to supervise neural implicit models that reconstructed a human body mesh. However, this approach has two limitations. Firstly, the lack of avatar-specific models can cause facial distortion and unrealistic clothing in the generated avatars. Secondly, CLIP only provides optimization direction for the overall appearance, resulting in less impressive results. To address these limitations, we propose AvatarFusion, the first framework to use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars while simultaneously segmenting clothing from the avatar's body. AvatarFusion includes the first clothing-decoupled neural implicit avatar model that employs a novel Dual Volume Rendering strategy to render the decoupled skin and clothing sub-models in one space. We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes, and generates a variety of clothing styles. Moreover, we establish the first benchmark for zero-shot text-to-avatar generation. Our experimental results demonstrate that our framework outperforms previous approaches, with significant improvements observed in all metrics. Additionally, since our model is clothing-decoupled, we can exchange the clothes of avatars. Code will be available on Github.

updated: Thu Jul 13 2023 02:19:56 GMT+0000 (UTC)

published: Thu Jul 13 2023 02:19:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト