V2Meow: Meowing to the Visual Beat via Music Generation

Kun Su; Judith Yue Li; Qingqing Huang; Dima Kuzmin; Joonseok Lee; Chris Donahue; Fei Sha; Aren Jansen; Yu Wang; Mauro Verzetti; Timo I. Denk

V2Meow: 音楽生成による視覚的なビートに合わせて鳴く

ビデオのビジュアルコンテンツを補完する高品質の音楽を生成することは、困難な作業です。既存の視覚条件付き音楽生成システムのほとんどは、生のオーディオ波形ではなく、MIDI ファイルなどの記号音楽データを生成します。利用可能な記号音楽データが限られているため、このような方法では、少数の楽器または特定の種類の視覚入力に対してのみ音楽を生成できます。この論文では、さまざまな種類のビデオ入力の視覚的セマンティクスとよく調和する高品質の音楽オーディオを生成できる、V2Meow と呼ばれる新しいアプローチを提案します。具体的には、提案された音楽生成システムは、実際のミュージックビデオからマイニングされたビデオフレームとペアになった多数の O(100K) 音楽オーディオクリップでトレーニングされる多段階の自己回帰モデルであり、並列の象徴的な音楽はありません。データが関係しています。 V2Meow は、任意のサイレントビデオクリップから抽出された事前トレーニング済みの視覚的特徴のみを条件として、高忠実度の音楽オーディオ波形を合成できます。また、サポートされるテキストプロンプトを介して、生成サンプルの音楽スタイルを高レベルで制御することもできます。ビデオフレームコンディショニング。定性的および定量的評価の両方を通じて、我々のモデルが視覚と音声の対応と音質の両方の点でいくつかの既存の音楽生成システムよりも優れていることを実証します。

Generating high quality music that complements the visual content of a video is a challenging task. Most existing visual conditioned music generation systems generate symbolic music data, such as MIDI files, instead of raw audio waveform. Given the limited availability of symbolic music data, such methods can only generate music for a few instruments or for specific types of visual input. In this paper, we propose a novel approach called V2Meow that can generate high-quality music audio that aligns well with the visual semantics of a diverse range of video input types. Specifically, the proposed music generation system is a multi-stage autoregressive model which is trained with a number of O(100K) music audio clips paired with video frames, which are mined from in-the-wild music videos, and no parallel symbolic music data is involved. V2Meow is able to synthesize high-fidelity music audio waveform solely conditioned on pre-trained visual features extracted from an arbitrary silent video clip, and it also allows high-level control over the music style of generation examples via supporting text prompts in addition to the video frames conditioning. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms several existing music generation systems in terms of both visual-audio correspondence and audio quality.

updated: Thu May 11 2023 06:26:41 GMT+0000 (UTC)

published: Thu May 11 2023 06:26:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト