Sequence-to-Sequence Predictive Model: From Prosody To Communicative Gestures

Fajrian Yunus; Chloé Clavel; Catherine Pelachaud

シーケンス間予測モデル：韻律からコミュニケーションジェスチャまで

コミュニケーションジェスチャーとスピーチアコースティックは密接に関連しています。私たちの目的は、音響に従ってジェスチャーのタイミングを予測することです。つまり、特定のジェスチャがいつ発生するかを予測する必要があります。注意メカニズムを備えたリカレントニューラルネットワークに基づくモデルを開発します。モデルは、音声の音響とジェスチャーのフェーズとタイプに注釈が付けられた、自然な二者間相互作用のコーパスでトレーニングされます。モデルの入力は一連の音声アコースティックであり、出力は一連のジェスチャクラスです。モデル出力に使用しているクラスは、ジェスチャフェーズとジェスチャタイプの組み合わせに基づいています。シーケンス比較手法を使用して、モデルのパフォーマンスを評価します。モデルは、特定のジェスチャクラスを他のクラスよりも適切に予測できることがわかりました。また、基本周波数がジェスチャ予測タスクに関連する機能であることを明らかにするアブレーション研究も実行します。別のサブ実験では、ビートジェスチャとして機能する眉の動きを含めるとパフォーマンスが向上することがわかりました。さらに、ある特定の話者のデータでトレーニングされたモデルが、同じ会話の他の話者でも機能することもわかりました。また、主観的な実験を行って、回答者が仮想エージェントの生成されたジェスチャタイミングの自然さ、時間の一貫性、および意味の一貫性をどのように判断するかを測定します。回答者は、モデルの出力を好意的に評価しています。

Communicative gestures and speech acoustic are tightly linked. Our objective is to predict the timing of gestures according to the acoustic. That is, we want to predict when a certain gesture occurs. We develop a model based on a recurrent neural network with attention mechanism. The model is trained on a corpus of natural dyadic interaction where the speech acoustic and the gesture phases and types have been annotated. The input of the model is a sequence of speech acoustic and the output is a sequence of gesture classes. The classes we are using for the model output is based on a combination of gesture phases and gesture types. We use a sequence comparison technique to evaluate the model performance. We find that the model can predict better certain gesture classes than others. We also perform ablation studies which reveal that fundamental frequency is a relevant feature for gesture prediction task. In another sub-experiment, we find that including eyebrow movements as acting as beat gesture improves the performance. Besides, we also find that a model trained on the data of one given speaker also works for the other speaker of the same conversation. We also perform a subjective experiment to measure how respondents judge the naturalness, the time consistency, and the semantic consistency of the generated gesture timing of a virtual agent. Our respondents rate the output of our model favorably.

updated: Fri Apr 23 2021 21:03:40 GMT+0000 (UTC)

published: Mon Aug 17 2020 21:55:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト