xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Can Qin; Congying Xia; Krithika Ramakrishnan; Michael Ryoo; Lifu Tu; Yihao Feng; Manli Shu; Honglu Zhou; Anas Awadalla; Jun Wang; Senthil Purushwalkam; Le Xue; Yingbo Zhou; Huan Wang; Silvio Savarese; Juan Carlos Niebles; Zeyuan Chen; Ran Xu; Caiming Xiong

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

updated: Sat Aug 31 2024 05:12:09 GMT+0000 (UTC)

published: Thu Aug 22 2024 17:55:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト