Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Mohan Zhou; Yalong Bai; Wei Zhang; Ting Yao; Tiejun Zhao; Tao Mei

レスポンシブリスニングヘッドの生成：ベンチマークデータセットとベースライン

対面での会話中にリスナーの応答性の高いフィードバック（うなずき、笑顔など）を合成するための、新しいリスニングヘッド生成ベンチマークを提示します。トーキングヘッズ世代の不可欠な補完物として、リスニングヘッズ世代はほとんど文献で研究されていません。話す頭に積極的に反応するリスニング行動を自動的に合成することは、デジタルヒューマン、仮想エージェント、ソーシャルロボットなどのアプリケーションにとって重要です。この作品では、対面会話中のリスニングヘッド世代を強調する新しいデータセット「ViCo」を提案します。合計92のアイデンティティ（67人のスピーカーと76人のリスナー）がViCoに関与しており、ペアの「スピーキング-リスニング」パターンで483個のクリップが特徴で、リスナーは態度に基づいて3つのリスニングスタイル（ポジティブ、ニュートラル、ネガティブ）を示します。従来のスピーチからジェスチャまたはトーキングヘッドの生成とは異なり、リスニングヘッドの生成は、話者からの音声信号と視覚信号の両方を入力として受け取り、リアルタイムで非言語的フィードバック（頭の動き、顔の表情など）を提供します。マナー。私たちのデータセットは、人間から人間への相互作用、ビデオからビデオへの翻訳、クロスモーダルな理解や生成など、幅広いアプリケーションをサポートしています。さらなる研究を促進するために、さまざまなリスニング態度を条件として、リスニングヘッド世代のベースラインもリリースします。プロジェクトページ：https：//project.mhzhou.com/rld。

We present a new listening head generation benchmark, for synthesizing responsive feedbacks of a listener (e.g., nod, smile) during a face-to-face conversation. As the indispensable complement to talking heads generation, listening head generation has seldomly been studied in literature. Automatically synthesizing listening behavior that actively responds to a talking head, is critical to applications such as digital human, virtual agents and social robots. In this work, we propose a novel dataset "ViCo", highlighting the listening head generation during a face-to-face conversation. A total number of 92 identities (67 speakers and 76 listeners) are involved in ViCo, featuring 483 clips in a paired "speaking-listening" pattern, where listeners show three listening styles based on their attitudes: positive, neutral, negative. Different from traditional speech-to-gesture or talking-head generation, listening head generation takes as input both the audio and visual signals from the speaker, and gives non-verbal feedbacks (e.g., head motions, facial expressions) in a real-time manner. Our dataset supports a wide range of applications such as human-to-human interaction, video-to-video translation, cross-modal understanding and generation. To encourage further research, we also release a listening head generation baseline, conditioning on different listening attitudes. Project page: https://project.mhzhou.com/rld.

updated: Tue Mar 15 2022 05:48:18 GMT+0000 (UTC)

published: Mon Dec 27 2021 07:18:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト