CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang; Jiayan Teng; Wendi Zheng; Ming Ding; Shiyu Huang; Jiazheng Xu; Yuanming Yang; Wenyi Hong; Xiaohan Zhang; Guanyu Feng; Da Yin; Xiaotao Gu; Yuxuan Zhang; Weihan Wang; Yean Cheng; Ting Liu; Bin Xu; Yuxiao Dong; Jie Tang

We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weights of both the 3D Causal VAE and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

updated: Mon Aug 12 2024 11:47:11 GMT+0000 (UTC)

published: Mon Aug 12 2024 11:47:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト