VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk; Lijun Yu; Xiuye Gu; José Lezama; Jonathan Huang; Grant Schindler; Rachel Hornung; Vighnesh Birodkar; Jimmy Yan; Ming-Chang Chiu; Krishna Somandepalli; Hassan Akbari; Yair Alon; Yong Cheng; Josh Dillon; Agrim Gupta; Meera Hahn; Anja Hauth; David Hendon; Alonso Martinez; David Minnen; Mikhail Sirotenko; Kihyuk Sohn; Xuan Yang; Hartwig Adam; Ming-Hsuan Yang; Irfan Essa; Huisheng Wang; David A. Ross; Bryan Seybold; Lu Jiang

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

updated: Tue Jun 04 2024 17:25:20 GMT+0000 (UTC)

published: Thu Dec 21 2023 18:46:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト