Taming Visually Guided Sound Generation

Vladimir Iashin; Esa Rahtu

視覚的にガイドされたサウンド生成を使いこなす

視覚的に誘発されるオーディオ生成の最近の進歩は、短く、忠実度の低い、1クラスのサウンドのサンプリングに基づいています。さらに、最先端のモデルから1秒のオーディオをサンプリングするには、ハイエンドGPUで数分かかります。この作業では、単一のGPUで再生するよりも短い時間で、オープンドメインビデオからのフレームのセットで促される視覚的に関連性のある忠実度の高いサウンドを生成できる単一のモデルを提案します。ビデオ機能のセットを指定して、事前にトレーニングされたスペクトログラムコードブックから新しいスペクトログラムをサンプリングするようにトランスフォーマーをトレーニングします。コードブックは、新しいスペクトログラムベースの知覚損失を伴うコンパクトなサンプリング空間を生成するようにトレーニングされたVQGANのバリアントを使用して取得されます。生成されたスペクトログラムは、ウィンドウベースのGANを使用して波形に変換され、生成が大幅に高速化されます。生成されたスペクトログラムを自動評価するためのメトリックがないことを考慮して、FIDおよびMKLと呼ばれるメトリックのファミリも構築します。これらのメトリックは、Melceptionと呼ばれる新しいサウンド分類子に基づいており、オープンドメインサンプルの忠実度と関連性を評価するように設計されています。生成されたサンプルの忠実度と関連性を評価するために、定性的研究と定量的研究の両方が小規模および大規模のデータセットで実施されます。また、モデルを最先端のものと比較し、品質、サイズ、および計算時間の大幅な改善を観察します。コード、デモ、サンプル：v-iashin.github.io/SpecVQGAN

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

updated: Sun Oct 17 2021 11:14:00 GMT+0000 (UTC)

published: Sun Oct 17 2021 11:14:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト