Video Probabilistic Diffusion Models in Projected Latent Space

Sihyun Yu; Kihyuk Sohn; Subin Kim; Jinwoo Shin

投影された潜在空間におけるビデオ確率的拡散モデル

深い生成モデルの目覚ましい進歩にもかかわらず、高解像度で時間的にコヒーレントなビデオを合成することは、その高次元性と複雑な時間的ダイナミクスに加えて大きな空間的変動があるため、依然として課題のままです。拡散モデルに関する最近の研究は、この課題を解決する可能性を示していますが、スケーラビリティを制限する深刻な計算およびメモリの非効率性に悩まされています。この問題を処理するために、ビデオの新しい生成モデル、造語された投影潜在ビデオ拡散モデル (PVDM) を提案します。これは、低次元の潜在空間でビデオ分布を学習し、高解像度で効率的にトレーニングできる確率的拡散モデルです。限られたリソースの下でビデオ。具体的には、PVDM は次の 2 つのコンポーネントで構成されます。(a) 特定のビデオを、ビデオピクセルの複雑な立方体構造を因数分解する 2D 形状の潜在ベクトルとして投影するオートエンコーダーと、(b) 新しい因数分解された潜在空間に特化した拡散モデルアーキテクチャと、単一のモデルで任意の長さのビデオを合成するためのトレーニング/サンプリング手順。一般的なビデオ生成データセットでの実験では、以前のビデオ合成方法と比較して PVDM の優位性が実証されています。たとえば、PVDM は UCF-101 ロングビデオ (128 フレーム) 生成ベンチマークで 639.7 の FVD スコアを取得し、これは従来の最新技術の 1773.4 を改善します。

Despite the remarkable progress in deep generative models, synthesizing high-resolution and temporally coherent videos still remains a challenge due to their high-dimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation- and memory-inefficiency that limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion models (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on popular video generation datasets demonstrate the superiority of PVDM compared with previous video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of the prior state-of-the-art.

updated: Thu Mar 30 2023 07:08:21 GMT+0000 (UTC)

published: Wed Feb 15 2023 14:22:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト