StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

Ivan Skorokhodov; Sergey Tulyakov; Mohamed Elhoseiny

StyleGAN-V：StyleGAN2の価格、画質、特典を備えた連続ビデオジェネレーター

ビデオは継続的なイベントを表示しますが、すべてではないにしても、ほとんどのビデオ合成フレームワークはそれらを時間内に個別に処理します。この作業では、ビデオがどうあるべきか、つまり時間連続信号について考え、神経表現のパラダイムを拡張して、連続時間ビデオジェネレーターを構築します。このために、最初に位置埋め込みのレンズを介して連続モーション表現を設計します。次に、非常にまばらなビデオでのトレーニングの問題を調査し、クリップあたりわずか2フレームを使用することで優れたジェネレーターを学習できることを示します。その後、従来の画像とビデオの弁別器のペアを再考し、フレームの特徴を連結するだけで時間情報を集約する全体的な弁別子を設計します。これにより、トレーニングコストが削減され、ジェネレータに豊富な学習信号が提供されるため、初めて1024^2ビデオで直接トレーニングすることが可能になります。 StyleGAN2の上にモデルを構築し、ほぼ同じ画質を実現しながら、同じ解像度でトレーニングする方が約5％高くなります。さらに、私たちの潜在空間は同様の特性を備えており、私たちの方法が時間内に伝播できる空間操作を可能にします。任意の長さのビデオを任意の高フレームレートで生成できますが、以前の作業では、固定レートで64フレームを生成するのに苦労していました。私たちのモデルは、4つの最新の256^2と1つの1024^2解像度のビデオシンセサイザーベンチマークでテストされています。完全なメトリックの観点から、それは最も近い次点者よりも平均して約30％優れています。プロジェクトのウェブサイト：https：//universome.github.io。

Videos show continuous events, yet most - if not all - video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be - time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image + video discriminators pair and design a holistic discriminator that aggregates temporal information by simply concatenating frames' features. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024^2 videos for the first time. We build our model on top of StyleGAN2 and it is just ≈5% more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model is tested on four modern 256^2 and one 1024^2-resolution video synthesis benchmarks. In terms of sheer metrics, it performs on average ≈30% better than the closest runner-up. Project website: https://universome.github.io.

updated: Tue May 31 2022 20:39:09 GMT+0000 (UTC)

published: Wed Dec 29 2021 17:58:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト