Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Tsu-Jui Fu; Licheng Yu; Ning Zhang; Cheng-Yang Fu; Jong-Chyi Su; William Yang Wang; Sean Bell

Tell Me What Happened: マルチモーダルマスクビデオ生成によるテキストガイド付きビデオ補完の統合

最初のいくつかの静的フレームを指定してビデオを生成することは、時間的な一貫性を備えた合理的な将来のフレームを予測するため、困難です。ビデオの予測に加えて、最後のフレームから巻き戻したり、頭と尾の間を埋めたりする機能も重要ですが、ビデオの完成についてはほとんど検討されていません。ほんの数フレームのヒントとは異なる結果になる可能性があるため、自然言語に従ってビデオ補完を実行できるシステムは、制御性を大幅に向上させる可能性があります。これに触発されて、命令によって導かれた部分的なフレームからビデオを生成するようにモデルに要求する新しいタスク、テキストガイド付きビデオ補完 (TVC) を導入します。次に、この TVC タスクに対処するために、Multimodal Masked Video Generation (MMVG) を提案します。トレーニング中、MMVG はビデオフレームを視覚的なトークンに離散化し、それらのほとんどをマスクして、任意の時点からビデオの完成を実行します。推論時に、対応するマスキング条件を適用することで、1 つの MMVG モデルで、ビデオ予測、巻き戻し、埋め込みなど、TVC の 3 つのケースすべてに対応できます。自己中心的、アニメーション、ゲームなど、さまざまなビデオシナリオで MMVG を評価します。広範な実験結果は、MMVG が TVC のテキストガイダンスを使用して高品質の視覚的外観を生成するのに効果的であることを示しています。

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

updated: Wed Nov 23 2022 10:14:12 GMT+0000 (UTC)

published: Wed Nov 23 2022 10:14:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト