Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements

Kun Su; Xiulong Liu; Eli Shlizerman

マルチインストゥルメンタリストネット：体の動きからの教師なし音楽の生成

楽器を演奏するミュージシャンの体の動きを入力として、教師なしで音楽を生成する新しいシステムを提案します。楽器にラベルを付けずにビデオからマルチインストゥルメンタル音楽を生成することを学ぶことは、挑戦的な問題です。変革を実現するために、「Multi-instrumentalistNet」（MI Net）という名前のパイプラインを構築しました。パイプラインは、そのベースで、マルチバンド残差ブロックを備えたベクトル量子化変分オートエンコーダー（VQ-VAE）を使用して、対数スペクトログラムからさまざまな楽器音楽の離散潜在表現を学習します。次に、パイプラインは、リカレントニューラルネットワークによってエンコードされたミュージシャンの体のキーポイントの動きを条件とする自己回帰事前条件とともにトレーニングされます。身体運動エンコーダーを使用した事前の共同トレーニングは、音楽の要素と楽器の特徴を示す潜在的な特徴への音楽のもつれを解くことに成功します。潜在空間は、新しい音楽を生成できる別個の楽器にクラスター化された分布をもたらします。さらに、VQ-VAEアーキテクチャは、追加のコンディショニングを備えた詳細な音楽生成をサポートします。パイプラインがビデオ内の楽器によって再生されている音楽の正確なコンテンツを生成するように、Midiが潜在空間をさらに調整できることを示します。 13の楽器のビデオを含む2つのデータセットでMINetを評価し、対応する楽器と簡単に関連付けられ、音楽のオーディオコンテンツと一致する、妥当なオーディオ品質の生成された音楽を取得します。

We propose a novel system that takes as an input body movements of a musician playing a musical instrument and generates music in an unsupervised setting. Learning to generate multi-instrumental music from videos without labeling the instruments is a challenging problem. To achieve the transformation, we built a pipeline named 'Multi-instrumentalistNet' (MI Net). At its base, the pipeline learns a discrete latent representation of various instruments music from log-spectrogram using a Vector Quantized Variational Autoencoder (VQ-VAE) with multi-band residual blocks. The pipeline is then trained along with an autoregressive prior conditioned on the musician's body keypoints movements encoded by a recurrent neural network. Joint training of the prior with the body movements encoder succeeds in the disentanglement of the music into latent features indicating the musical components and the instrumental features. The latent space results in distributions that are clustered into distinct instruments from which new music can be generated. Furthermore, the VQ-VAE architecture supports detailed music generation with additional conditioning. We show that a Midi can further condition the latent space such that the pipeline will generate the exact content of the music being played by the instrument in the video. We evaluate MI Net on two datasets containing videos of 13 instruments and obtain generated music of reasonable audio quality, easily associated with the corresponding instrument, and consistent with the music audio content.

updated: Mon Dec 07 2020 06:54:10 GMT+0000 (UTC)

published: Mon Dec 07 2020 06:54:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト