BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

Haiyang Liu; Zihao Zhu; Naoya Iwamoto; Yichen Peng; Zhengqing Li; You Zhou; Elif Bozkurt; Bo Zheng

BEAT: 会話ジェスチャー合成のための大規模なセマンティックおよびエモーショナルマルチモーダルデータセット

利用可能なデータセット、モデル、および標準的な評価指標が不足しているため、マルチモーダルデータを条件とする、現実的で鮮やかな、人間のような合成された会話ジェスチャーを実現することは、依然として未解決の問題です。これに対処するために、BEAT という Body-Expression-Audio-Text データセットを構築しました。これには、i) 8 つの異なる感情と 4 つの異なる言語で話している 30 人の話者からキャプチャされた 76 時間の高品質のマルチモーダルデータ、ii) 32 があります。何百万ものフレームレベルの感情および意味論的関連性アノテーション。 BEAT に関する統計分析は、音声、テキスト、および話者のアイデンティティとの既知の相関に加えて、会話のジェスチャーと顔の表情、感情、およびセマンティクスとの相関を示しています。この観察に基づいて、ベースラインモデルであるカスケードモーションネットワーク (CaMN) を提案します。これは、ジェスチャ合成のカスケードアーキテクチャでモデル化された 6 つのモダリティで構成されます。セマンティック関連性を評価するために、メトリクスである Semantic Relevance Gesture Recall (SRGR) を導入します。定性的および定量的実験により、メトリックの有効性、グラウンドトゥルースデータの品質、およびベースラインの最先端のパフォーマンスが実証されます。私たちの知る限りでは、BEAT は人間のジェスチャーを調査するための最大のモーションキャプチャデータセットであり、制御可能なジェスチャー合成、クロスモダリティ分析、感情ジェスチャー認識など、さまざまな研究分野に貢献する可能性があります。データ、コード、およびモデルは、https://pantomatrix.github.io/BEAT/ で入手できます。

Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Based on this observation, we propose a baseline model, Cascaded Motion Network (CaMN), which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (SRGR). Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge, BEAT is the largest motion capture dataset for investigating human gestures, which may contribute to a number of different research fields, including controllable gesture synthesis, cross-modality analysis, and emotional gesture recognition. The data, code and model are available on https://pantomatrix.github.io/BEAT/.

updated: Tue Sep 20 2022 05:44:29 GMT+0000 (UTC)

published: Thu Mar 10 2022 11:19:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト