3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Gang Li; Heliang Zheng; Chaoyue Wang; Chang Li; Changwen Zheng; Dacheng Tao

3DDesigner: フォトリアリスティックな 3D オブジェクトの生成とテキストガイド付き拡散モデルによる編集に向けて

テキストガイド拡散モデルは、画像/ビデオの生成と編集において優れたパフォーマンスを示しています。 3D シナリオで実行された探索はほとんどありませんが。この論文では、このトピックに関する 3 つの基本的で興味深い問題について説明します。まず、3D 一貫性のある生成を実現するために、テキストガイド付き拡散モデルを装備します。具体的には、NeRF のようなニューラルフィールドを統合して、特定のカメラビューに対して低解像度の粗い結果を生成します。このような結果は、次の拡散プロセスの条件情報として 3D 事前分布を提供できます。ノイズ除去拡散中に、新しい 2 ストリーム (2 つの異なるビューに対応する) 非同期拡散プロセスを使用してクロスビュー対応をモデル化することにより、3D の一貫性をさらに高めます。次に、3D ローカル編集を研究し、1 つのビューからオブジェクトを編集することで 360^∘ 操作結果を生成できる 2 段階のソリューションを提案します。ステップ 1 では、予測されたノイズをブレンドして 2D ローカル編集を実行することを提案します。ステップ 2 では、2D 混合ノイズをビューに依存しないテキスト埋め込み空間にマッピングする、ノイズからテキストへの反転プロセスを実行します。対応するテキスト埋め込みが取得されると、360^∘ 画像を生成できます。最後に大事なことを言い忘れましたが、モデルを拡張して、単一の画像を微調整することにより、ワンショットの新規ビュー合成を実行します。まず、新規ビュー合成にテキストガイダンスを活用する可能性を示します。広範な実験とさまざまなアプリケーションが、当社の 3DDesigner の能力を示しています。プロジェクトページは https://3ddesigner-diffusion.github.io/ にあります。

Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve 3D-consistent generation. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study 3D local editing and propose a two-step solution that can generate 360^∘ manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360^∘ images can be generated. Last but not least, we extend our model to perform one-shot novel view synthesis by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. Project page is available at https://3ddesigner-diffusion.github.io/.

updated: Fri Nov 25 2022 13:50:00 GMT+0000 (UTC)

published: Fri Nov 25 2022 13:50:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト