Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

Jiawei Liu; Weining Wang; Sihan Chen; Xinxin Zhu; Jing Liu

Sounding Video Generator: テキストガイドによるサウンディングビデオ生成のための統合フレームワーク

ビジュアル信号とオーディオ信号の組み合わせとして、ビデオは本質的にマルチモーダルです。ただし、既存のビデオ生成方法は、主に視覚的なフレームの合成を目的としていますが、リアルなビデオのオーディオ信号は無視されます。この作業では、ほとんど調査されていないテキストガイド付きサウンディングビデオ生成の問題に集中し、オーディオ信号と共にリアルなビデオを生成するための統合フレームワークであるサウンディングビデオジェネレータ (SVG) を提案します。具体的には、SVG-VQGAN を提示して、ビジュアルフレームとオーディオメルスペクトログラムを個別のトークンに変換します。 SVG-VQGAN は、新しいハイブリッド対比学習法を適用して、モーダル間およびモーダル内の一貫性をモデル化し、量子化された表現を改善します。クロスモーダル注意モジュールを使用して、視覚フレームとオーディオ信号の関連する特徴を抽出し、対照的な学習を行います。次に、Transformer ベースのデコーダーを使用して、自動回帰サウンディングビデオ生成用のトークンレベルでテキスト、ビジュアルフレーム、およびオーディオ信号間の関連付けをモデル化します。人間が注釈を付けたテキスト、ビデオ、オーディオのペアのデータセットである AudioSetCap は、SVG のトレーニング用に作成されています。実験結果は、既存のテキストからビデオへの生成方法や、Kinetics および VAS データセットでの音声生成方法と比較した場合、この方法の優位性を示しています。

As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSetCap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing textto-video generation methods as well as audio generation methods on Kinetics and VAS datasets.

updated: Wed Mar 29 2023 09:07:31 GMT+0000 (UTC)

published: Wed Mar 29 2023 09:07:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト