3D-CLFusion: Fast Text-to-3D Rendering with Contrastive Latent Diffusion

Yu-Jhe Li; Kris Kitani

3D-CLFusion: 対照的な潜在拡散によるテキストから 3D への高速レンダリング

事前にトレーニングされた潜在ベースの NeRF (入力潜在コードを指定して 3D オブジェクトを生成する NeRF) を使用して、テキストから 3D への作成のタスクに取り組みます。 DreamFusion や Magic3D などの最近の作業は、NeRF とテキストプロンプトを使用して 3D コンテンツを生成することに大きな成功を収めていますが、すべてのテキストプロンプトに対して NeRF を最適化する現在のアプローチは、1) 非常に時間がかかり、2) 低解像度の出力につながることがよくあります。 .これらの課題に対処するために、3D-CLFusion という名前の新しい方法を提案します。これは、事前にトレーニングされた潜在ベースの NeRF を活用し、1 分未満で高速な 3D コンテンツ作成を実行します。特に、入力 CLIP テキスト/画像埋め込みから w 潜在を学習するための潜在拡散優先ネットワークを導入します。このパイプラインにより、推論中にさらに最適化することなく w 潜在を生成でき、事前トレーニング済みの NeRF は、潜在に基づいてマルチビューの高解像度 3D 合成を実行できます。私たちのモデルの目新しさは、有効なビュー不変の潜在コードの生成を可能にする事前拡散のトレーニング中に対照的な学習を導入することにあることに注意してください。テキストから 3D への高速な作成、たとえば DreamFusion よりも 100 倍高速な、提案されたビュー不変拡散プロセスの有効性を実験を通じて実証します。私たちのモデルは、事前トレーニング済みの NeRF を使用して、テキストを 3D に変換するためのプラグアンドプレイツールの役割を果たすことができることに注意してください。

We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs (NeRFs that generate 3D objects given input latent code). Recent works such as DreamFusion and Magic3D have shown great success in generating 3D content using NeRFs and text prompts, but the current approach of optimizing a NeRF for every text prompt is 1) extremely time-consuming and 2) often leads to low-resolution outputs. To address these challenges, we propose a novel method named 3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs fast 3D content creation in less than a minute. In particular, we introduce a latent diffusion prior network for learning the w latent from the input CLIP text/image embeddings. This pipeline allows us to produce the w latent without further optimization during inference and the pre-trained NeRF is able to perform multi-view high-resolution 3D synthesis based on the latent. We note that the novelty of our model lies in that we introduce contrastive learning during training the diffusion prior which enables the generation of the valid view-invariant latent code. We demonstrate through experiments the effectiveness of our proposed view-invariant diffusion process for fast text-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our model is able to serve as the role of a plug-and-play tool for text-to-3D with pre-trained NeRFs.

updated: Tue Mar 21 2023 15:38:26 GMT+0000 (UTC)

published: Tue Mar 21 2023 15:38:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト