SceneScape: Text-Driven Consistent Scene Generation

Rafail Fridman; Amit Abecasis; Yoni Kasten; Tali Dekel

SceneScape: テキスト主導の一貫したシーン生成

シーンとカメラのポーズを説明する入力テキストのみから、任意のシーンの長いビデオを合成する、テキスト駆動型のパーペチュアルビュー生成の方法を提案します。事前トレーニング済みのテキストから画像へのモデルの生成力と、事前トレーニング済みの単眼深度予測モデルによって学習された幾何学的事前確率を組み合わせることにより、オンラインでこのようなビデオを生成する新しいフレームワークを紹介します。 3D の一貫性を実現するため、つまり、幾何学的にもっともらしいシーンを描写するビデオを生成するために、現在のフレームの予測深度マップが合成されたシーンと幾何学的に一致するようにオンラインテスト時間トレーニングを展開します。深度マップは、シーンの統一されたメッシュ表現を構築するために使用されます。これは、生成全体で更新され、レンダリングに使用されます。限られた領域 (風景など) にしか適用できない以前の作品とは対照的に、このフレームワークは、宇宙船、洞窟、氷の城のウォークスルーなど、多様なシーンを生成します。プロジェクトページ：https://scenescape.github.io/

We propose a method for text-driven perpetual view generation -- synthesizing long videos of arbitrary scenes solely from an input text describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To achieve 3D consistency, i.e., generating videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene; the depth maps are used to construct a unified mesh representation of the scene, which is updated throughout the generation and is used for rendering. In contrast to previous works, which are applicable only for limited domains (e.g., landscapes), our framework generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles. Project page: https://scenescape.github.io/

updated: Thu Feb 02 2023 14:47:19 GMT+0000 (UTC)

published: Thu Feb 02 2023 14:47:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト