3D generation on ImageNet

Ivan Skorokhodov; Aliaksandr Siarohin; Yinghao Xu; Jian Ren; Hsin-Ying Lee; Peter Wonka; Sergey Tulyakov

ImageNet での 3D 生成

既存の 2D からの 3D ジェネレーターは通常、すべてのオブジェクトが (ほぼ) 同じ縮尺、3D 位置、および方向を持ち、カメラが常にシーンの中心を指している、適切にキュレーションされた単一カテゴリのデータセット用に設計されています。これにより、任意のカメラポーズからレンダリングされた整列不可能なシーンの多様な野生のデータセットには適用できなくなります。この作業では、Generic Priors (3DGP) を使用した 3D ジェネレーターを開発します。これは、トレーニングデータに関するより一般的な仮定を備えた 3D 合成フレームワークであり、ImageNet のような非常に困難なデータセットにスケーリングすることを示しています。私たちのモデルは、3 つの新しいアイデアに基づいています。まず、不正確さを処理する特別な深度適応モジュールを介して、不正確な市販の深度推定器を 3D GAN トレーニングに組み込みます。次に、柔軟なカメラモデルと正則化戦略を作成して、トレーニング中にその分布パラメーターを学習します。最後に、ディスクリミネーターの上に単純な蒸留ベースの手法を採用することにより、事前にトレーニングされた分類子からパッチごとにトレーニングされたモデルの GAN に知識を転送するという最近のアイデアを拡張します。既存の方法よりも安定したトレーニングを実現し、収束を少なくとも 40% 高速化します。 SDIP Dogs 256x256、SDIP Elephants 256x256、LSUN Horses 256x256、ImageNet 256x256 の 4 つのデータセットでモデルを調査し、3DGP がテクスチャとジオメトリ品質の両方の点で最近の最先端技術よりも優れていることを示します。コードと視覚化: https://snap-research.github.io/3dgp.

Existing 3D-from-2D generators are typically designed for well-curated single-category datasets, where all the objects have (approximately) the same scale, 3D location, and orientation, and the camera always points to the center of the scene. This makes them inapplicable to diverse, in-the-wild datasets of non-alignable scenes rendered from arbitrary camera poses. In this work, we develop a 3D generator with Generic Priors (3DGP): a 3D synthesis framework with more general assumptions about the training data, and show that it scales to very challenging datasets, like ImageNet. Our model is based on three new ideas. First, we incorporate an inaccurate off-the-shelf depth estimator into 3D GAN training via a special depth adaptation module to handle the imprecision. Then, we create a flexible camera model and a regularization strategy for it to learn its distribution parameters during training. Finally, we extend the recent ideas of transferring knowledge from pre-trained classifiers into GANs for patch-wise trained models by employing a simple distillation-based technique on top of the discriminator. It achieves more stable training than the existing methods and speeds up the convergence by at least 40%. We explore our model on four datasets: SDIP Dogs 256x256, SDIP Elephants 256x256, LSUN Horses 256x256, and ImageNet 256x256, and demonstrate that 3DGP outperforms the recent state-of-the-art in terms of both texture and geometry quality. Code and visualizations: https://snap-research.github.io/3dgp.

updated: Thu Mar 02 2023 17:06:57 GMT+0000 (UTC)

published: Thu Mar 02 2023 17:06:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト