Video Diffusion Models with Local-Global Context Guidance

Siyuan Yang; Lu Zhang; Yu Liu; Zhizhuo Jiang; You He

ローカル-グローバルコンテキストガイダンスを備えたビデオ普及モデル

拡散モデルは、予測、生成、補間などのビデオ合成タスクにおける強力なパラダイムとして登場しました。計算量の制限のため、既存の手法では通常、自己回帰推論パイプラインを使用した条件付き拡散モデルが実装されており、隣接する過去のフレームの分布に基づいて将来のフラグメントが予測されます。ただし、前の数フレームの条件だけでは全体的な時間的一貫性を捉えることができず、長期的なビデオ予測において一貫性のない、あるいはとんでもない結果につながる可能性があります。この論文では、条件付き/無条件の両方の設定で高品質のビデオを生成するための複数の知覚条件をキャプチャするローカル-グローバルコンテキスト誘導ビデオ拡散モデル (LGC-VD) を提案します。 LGC-VD では、UNet はセルフアテンションユニットを備えたスタックされた残差ブロックで実装され、3D 変換における望ましくない計算コストを回避します。私たちは、ローカルとグローバルのコンテキストガイダンス戦略を構築して、過去の断片の多知覚埋め込みをキャプチャし、将来予測の一貫性を高めます。さらに、より安定した予測を実現するために、ノイズの多いフレームの影響を軽減する 2 段階のトレーニング戦略を提案します。私たちの実験は、提案された方法がビデオ予測、補間、および無条件ビデオ生成において良好なパフォーマンスを達成することを示しています。 https://github.com/exisas/LGC-VD でコードをリリースします。

Diffusion models have emerged as a powerful paradigm in video synthesis tasks including prediction, generation, and interpolation. Due to the limitation of the computational budget, existing methods usually implement conditional diffusion models with an autoregressive inference pipeline, in which the future fragment is predicted based on the distribution of adjacent past frames. However, only the conditions from a few previous frames can't capture the global temporal coherence, leading to inconsistent or even outrageous results in long-term video prediction. In this paper, we propose a Local-Global Context guided Video Diffusion model (LGC-VD) to capture multi-perception conditions for producing high-quality videos in both conditional/unconditional settings. In LGC-VD, the UNet is implemented with stacked residual blocks with self-attention units, avoiding the undesirable computational cost in 3D Conv. We construct a local-global context guidance strategy to capture the multi-perceptual embedding of the past fragment to boost the consistency of future prediction. Furthermore, we propose a two-stage training strategy to alleviate the effect of noisy frames for more stable predictions. Our experiments demonstrate that the proposed method achieves favorable performance on video prediction, interpolation, and unconditional video generation. We release code at https://github.com/exisas/LGC-VD.

updated: Mon Jun 05 2023 03:32:27 GMT+0000 (UTC)

published: Mon Jun 05 2023 03:32:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト