CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language

Aditya Sanghi; Rao Fu; Vivian Liu; Karl Willis; Hooman Shayani; Amir Hosein Khasahmadi; Srinath Sridhar; Daniel Ritchie

CLIP-Sculptor: 自然言語からの高忠実度で多様な形状のゼロショット生成

最近の研究では、自然言語を使用して 3D 形状を生成および編集できることが実証されています。ただし、これらの方法では、忠実度と多様性が制限された形状が生成されます。 CLIP-Sculptor を導入します。これは、トレーニング中に (テキスト、形状) のペアを必要とせずに、忠実度の高い多様な 3D 形状を生成することにより、これらの制約に対処する方法です。 CLIP-Sculptor は、最初に低次元の潜在空間で生成し、次に形状の忠実度を向上させるために高解像度にアップスケールするマルチ解像度アプローチでこれを実現します。形状の多様性を改善するために、CLIP の画像とテキストの埋め込み空間を条件とする変換器を使用してモデル化された離散潜在空間を使用します。また、精度と多様性のトレードオフを改善する、分類子を使用しないガイダンスの新しいバリアントも提示します。最後に、CLIP-Sculptor が最先端のベースラインよりも優れていることを示す大規模な実験を行います。コードは https://ivl.cs.brown.edu/#/projects/clip-sculptor で入手できます。

Recent works have demonstrated that natural language can be used to generate and edit 3D shapes. However, these methods generate shapes with limited fidelity and diversity. We introduce CLIP-Sculptor, a method to address these constraints by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs during training. CLIP-Sculptor achieves this in a multi-resolution approach that first generates in a low-dimensional latent space and then upscales to a higher resolution for improved shape fidelity. For improved shape diversity, we use a discrete latent space which is modeled using a transformer conditioned on CLIP's image-text embedding space. We also present a novel variant of classifier-free guidance, which improves the accuracy-diversity trade-off. Finally, we perform extensive experiments demonstrating that CLIP-Sculptor outperforms state-of-the-art baselines. The code is available at https://ivl.cs.brown.edu/#/projects/clip-sculptor.

updated: Thu Apr 13 2023 20:37:07 GMT+0000 (UTC)

published: Wed Nov 02 2022 18:50:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト