Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Liao Shen; Xingyi Li; Huiqiang Sun; Juewen Peng; Ke Xian; Zhiguo Cao; Guosheng Lin

Make-It-4D: 単一の画像から一貫した長期のダイナミックシーンビデオを合成する

私たちは、たった 1 枚の画像から長時間の動的なビデオを合成する問題を研究します。大きなカメラの動きを考慮すると、一貫したビジュアルコンテンツの動きが必要となるため、これは困難です。既存の方法では、一貫性のない永続的なビューが幻覚のように表示されるか、長いカメラの軌道に苦労します。これらの問題に対処するには、基礎となる 4D (3D ジオメトリとシーンの動きを含む) を推定し、オクルージョンされた領域を埋めることが不可欠です。この目的を達成するために、単一の画像から一貫した長期のダイナミックビデオを生成できる新しい方法である Make-It-4D を紹介します。一方で、レイヤード深度画像 (LDI) を利用してシーンを表現し、その後それらを非投影にして特徴点群を形成します。ビジュアルコンテンツをアニメーション化するために、動き推定と対応するカメラのポーズから導出されたシーンフローに基づいて特徴点群が移動されます。このような 4D 表現により、私たちの方法は、生成された動的ビデオのグローバルな一貫性を維持できます。一方、事前トレーニングされた拡散モデルを使用して入力イメージを修復およびアウトペイントすることにより、オクルージョンされた領域を塗りつぶします。これにより、私たちの方法は大きなカメラの動きの下でも機能することができます。私たちの設計の恩恵により、私たちのメソッドはトレーニング不要となり、トレーニング時間を大幅に節約できます。実験結果は、私たちのアプローチの有効性を示しており、説得力のあるレンダリング結果を示しています。

We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pretrained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.

updated: Sun Aug 20 2023 12:53:50 GMT+0000 (UTC)

published: Sun Aug 20 2023 12:53:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト