Audio2Gestures: Generating Diverse Gestures from Audio

Jing Li; Di Kang; Wenjie Pei; Xuefei Zhe; Ying Zhang; Linchao Bao; Zhenyu He

Audio2Gestures: オーディオから多様なジェスチャーを生成する

人々は、同じ文章を話しているときに、さまざまな精神的および身体的要因の影響を受けて、さまざまなジェスチャーを実行する場合があります。この固有の 1 対多の関係により、オーディオからのコスピーチジェスチャの生成が特に困難になります。従来の CNN/RNN は 1 対 1 のマッピングを想定しているため、考えられるすべてのターゲットモーションの平均を予測する傾向があり、推論中に単純な/退屈なモーションが発生しやすくなります。そのため、クロスモーダル潜在コードを共有コードとモーション固有コードに分割することにより、1 対多のオーディオからモーションへのマッピングを明示的にモデル化することを提案します。共有コードは、オーディオとの相関性が高いモーションコンポーネントを担当することが期待されますが、モーション固有のコードは、オーディオからより独立した多様なモーション情報をキャプチャすることが期待されます。ただし、潜在コードを 2 つの部分に分割すると、トレーニングがさらに難しくなります。リラックスした動きの損失、自転車の制約、多様性の損失など、いくつかの重要なトレーニングの損失/戦略は、VAE をより適切にトレーニングするように設計されています。 3D と 2D の両方のモーションデータセットでの実験により、私たちの方法が以前の最先端の方法よりも定量的および定性的により現実的で多様なモーションを生成することが確認されました。さらに、私たちの定式化は、離散コサイン変換 (DCT) モデリングやその他の一般的なバックボーン (つまり、RNN、Transformer) と互換性があります。動きの損失と定量的な動きの評価に関しては、時間的および/または空間的なコンテキストを考慮した構造化された損失/メトリック (STFT など) が、最も一般的に使用されるポイントごとの損失 (PCK など) を補完し、より良い動きのダイナミクスとより微妙な動きをもたらします。詳細。最後に、タイムライン上でユーザー指定のモーションクリップを使用してモーションシーケンスを生成するために、この方法を簡単に使用できることを示します。

People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (i.e. RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (e.g. STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (e.g. PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline.

updated: Tue Jan 17 2023 04:09:58 GMT+0000 (UTC)

published: Tue Jan 17 2023 04:09:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト