Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Songwei Ge; Thomas Hayes; Harry Yang; Xi Yin; Guan Pang; David Jacobs; Jia-Bin Huang; Devi Parikh

時間にとらわれないVQGANと時間に敏感なトランスフォーマーによる長いビデオ生成

ビデオは、感情を表現し、情報を交換し、経験を共有するために作成されます。ビデオシンセサイザーは長い間研究者を魅了してきました。視覚的合成の進歩によって急速に進歩しているにもかかわらず、ほとんどの既存の研究は、フレームの品質とフレーム間の遷移の改善に焦点を当てていますが、より長いビデオの生成についてはほとんど進歩していません。この論文では、3D-VQGANとトランスフォーマーに基づいて、数千フレームのビデオを生成する方法を紹介します。私たちの評価は、UCF-101、Sky Time-lapse、Taichi-HDデータセットなどの標準ベンチマークからの16フレームビデオクリップでトレーニングされたモデルが、多様で一貫性のある高品質の長いビデオを生成できることを示しています。また、テキストとオーディオに時間情報を組み込むことにより、意味のある長いビデオを生成するためのアプローチの条件付き拡張を紹介します。ビデオとコードはhttps://songweige.github.io/projects/tats/index.htmlにあります。

Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames' quality and the transitions between them, while little progress has been made in generating longer videos. In this paper, we present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips from standard benchmarks such as UCF-101, Sky Time-lapse, and Taichi-HD datasets can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio. Videos and code can be found at https://songweige.github.io/projects/tats/index.html.

updated: Sun Jul 24 2022 00:25:54 GMT+0000 (UTC)

published: Thu Apr 07 2022 17:59:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト