Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Jing Li; Di Kang; Wenjie Pei; Xuefei Zhe; Ying Zhang; Zhenyu He; Linchao Bao

Audio2Gestures：条件付き変分オートエンコーダを使用して音声オーディオから多様なジェスチャを生成する

音声と体の動きの間には固有の1対多のマッピングがあるため、音声音声から会話ジェスチャーを生成することは困難です。従来のCNN / RNNは、1対1のマッピングを想定しているため、考えられるすべてのターゲットモーションの平均を予測する傾向があり、推論中に単純な/退屈なモーションが発生します。この問題を克服するために、クロスモーダル潜在コードを共有コードとモーション固有のコードに分割することにより、1対多のオーディオからモーションへのマッピングを明示的にモデル化する新しい条件付き変分オートエンコーダー（VAE）を提案します。共有コードは主にオーディオとモーション（同期されたオーディオとモーションビートなど）の間の強い相関関係をモデル化しますが、モーション固有のコードはオーディオとは無関係に多様なモーション情報をキャプチャします。ただし、潜在コードを2つの部分に分割すると、VAEモデルのトレーニングが困難になります。リラックスモーションロス、自転車拘束、ダイバーシティロスなどの他の手法とともにランダムサンプリングを容易にするマッピングネットワークは、VAEをより適切にトレーニングするように設計されています。 3Dと2Dの両方のモーションデータセットでの実験により、私たちの方法が最先端の方法よりも現実的で多様なモーションを定量的および定性的に生成することが確認されました。最後に、この方法を使用して、タイムライン上でユーザー指定のモーションクリップを使用してモーションシーケンスを生成できることを示します。コードとその他の結果はhttps://jingli513.github.io/audio2gesturesにあります。

Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures.

updated: Sun Aug 15 2021 11:15:51 GMT+0000 (UTC)

published: Sun Aug 15 2021 11:15:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト