APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

Yupu Yao; Shangqi Deng; Zihan Cao; Harry Zhang; Liang-Jian Deng

APLA: 敵対的トレーニングによる潜在ノイズの追加摂動により一貫性が実現

普及モデルは、ビデオ生成において有望な進歩を示しています。ただし、フレーム全体でローカル領域内の一貫した詳細を保持するのに苦労することがよくあります。根本的な原因の 1 つは、従来の拡散モデルが、入力自体に固有の情報の影響を完全に考慮せずに、予測ノイズを利用してガウスノイズ分布を近似していることです。さらに、これらのモデルは予測と参照の区別を強調し、ビデオに固有の情報を無視します。この制限に対処するために、セルフアテンションメカニズムにヒントを得て、我々は、敵対的トレーニングによる潜在ノイズの追加摂動 (APLA) と呼ばれる、拡散モデルに基づく新しいテキストからビデオへの (T2V) 生成ネットワーク構造を提案します。私たちのアプローチは入力として 1 つのビデオのみを必要とし、事前にトレーニングされた安定した拡散ネットワークに基づいて構築されます。特に、Video Generation Transformer (VGT) として知られる追加のコンパクトネットワークを導入しました。この補助コンポーネントは、入力内に含まれる固有の情報から摂動を抽出するように設計されており、それによって時間予測中に一貫性のないピクセルを改善します。トランスフォーマーとコンボリューションのハイブリッドアーキテクチャを活用して時間的な複雑さを補正し、ビデオ内の異なるフレーム間の一貫性を高めます。実験では、生成されたビデオの一貫性が質的にも量的にも顕著に向上していることが実証されました。

Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.

updated: Thu May 02 2024 01:07:49 GMT+0000 (UTC)

published: Thu Aug 24 2023 07:11:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト