3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation

Zutao Jiang; Guansong Lu; Xiaodan Liang; Jihua Zhu; Wei Zhang; Xiaojun Chang; Hang Xu

3D-TOGO: テキストガイドによるクロスカテゴリ 3D オブジェクト生成に向けて

テキストガイドによる 3D オブジェクト生成は、ユーザー定義のキャプションで記述された 3D オブジェクトを生成することを目的としています。これにより、想像したものを柔軟に視覚化できます。この困難なタスクを解決するためにいくつかの作業が行われましたが、これらの作業は、テクスチャがなく、写真のようにリアルなビューをレンダリングするための後処理が必要な明示的な 3D 表現 (メッシュなど) を利用しています。または、個々のケースごとに時間のかかる最適化を個別に行う必要があります。ここでは、テキストからビューへの生成モジュールとビューから 3D への生成モジュールを統合する新しい 3D-TOGO モデルを介して、一般的なテキストガイド付きのクロスカテゴリ 3D オブジェクト生成を実現する最初の試みを行います。テキストからビューへの生成モジュールは、入力キャプションを指定して、ターゲット 3D オブジェクトのさまざまなビューを生成するように設計されています。より良いビューの一貫性とキャプションの類似性を達成するために、事前ガイダンス、キャプションガイダンス、およびビューコントラスト学習が提案されています。一方、view-to-3D 生成モジュールには pixelNeRF モデルが採用され、以前に生成されたビューから暗黙的な 3D ニューラル表現が取得されます。私たちの 3D-TOGO モデルは、良好なテクスチャを備えたニューラルラディアンスフィールドの形で 3D オブジェクトを生成し、キャプションごとに時間コストの最適化を必要としません。また、3D-TOGO は生成された 3D オブジェクトのカテゴリ、色、形状を入力キャプションで制御できます。 PSNR、SSIM、LPIPS、および CLIP に関して、98 の異なるカテゴリにわたる入力キャプションに従って、3D-TOGO が高品質の 3D オブジェクトをより適切に生成できることを確認するために、最大の 3D オブジェクトデータセット (つまり、ABO) に関する広範な実験が行われます。スコア、テキスト NeRF および Dreamfields と比較。

Text-guided 3D object generation aims to generate 3D objects described by user-defined captions, which paves a flexible way to visualize what we imagined. Although some works have been devoted to solving this challenging task, these works either utilize some explicit 3D representations (e.g., mesh), which lack texture and require post-processing for rendering photo-realistic views; or require individual time-consuming optimization for every single case. Here, we make the first attempt to achieve generic text-guided cross-category 3D object generation via a new 3D-TOGO model, which integrates a text-to-views generation module and a views-to-3D generation module. The text-to-views generation module is designed to generate different views of the target 3D object given an input caption. prior-guidance, caption-guidance and view contrastive learning are proposed for achieving better view-consistency and caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D generation module to obtain the implicit 3D neural representation from the previously-generated views. Our 3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture and requires no time-cost optimization for every single caption. Besides, 3D-TOGO can control the category, color and shape of generated 3D objects with the input caption. Extensive experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects according to the input captions across 98 different categories, in terms of PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields.

updated: Wed Aug 16 2023 07:12:41 GMT+0000 (UTC)

published: Fri Dec 02 2022 11:31:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト