Sound-Guided Semantic Video Generation

Seung Hyun Lee; Gyeongrok Oh; Wonmin Byeon; Chanyoung Kim; Won Jeong Ryoo; Sang Ho Yoon; Hyunjun Cho; Jihyun Bae; Jinkyu Kim; Sangpil Kim

サウンドガイド付きセマンティックビデオ生成

StyleGANでの最近の成功は、事前にトレーニングされたStyleGAN潜在空間が現実的なビデオ生成に役立つことを示しています。ただし、StyleGAN潜在空間の方向と大きさを決定するのが難しいため、ビデオで生成されたモーションは通常、意味的に意味がありません。本論文では、マルチモーダル（音像テキスト）埋め込み空間を活用してリアルな動画を生成するフレームワークを提案する。サウンドはシーンの時間的コンテキストを提供するため、フレームワークは、サウンドと意味的に一致するビデオを生成することを学習します。まず、サウンド反転モジュールは、オーディオを直接StyleGAN潜在空間にマッピングします。次に、CLIPベースのマルチモーダル埋め込みスペースを組み込んで、視聴覚関係をさらに提供します。最後に、提案されたフレームジェネレータは、対応する音とコヒーレントである潜在空間内の軌道を見つけることを学習し、階層的な方法でビデオを生成します。サウンドガイド付きビデオ生成タスク用の新しい高解像度ランドスケープビデオデータセット（オーディオビジュアルペア）を提供します。実験は、私たちのモデルがビデオ品質の点で最先端の方法よりも優れていることを示しています。さらに、私たちの方法の有効性を検証するために、画像やビデオの編集を含むいくつかのアプリケーションを示します。

The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.

updated: Tue Aug 30 2022 08:00:48 GMT+0000 (UTC)

published: Wed Apr 20 2022 07:33:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト