Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Lilin Cheng; Suzhe Wang; Zhimeng Zhang; Yu Ding; Yixing Zheng; Xin Yu; Changjie Fan

Write-a-speaker：テキストベースの感情的およびリズミカルなトーキングヘッドジェネレーション

本論文では、スピーチのリズムとポーズだけでなく、文脈的感情に従って忠実度の高い表情と頭の動きを合成する、新しいテキストベースのトーキングヘッドビデオ生成フレームワークを提案します。具体的には、私たちのフレームワークは、話者に依存しないステージと話者固有のステージで構成されています。話者に依存しない段階では、3つの並列ネットワークを設計して、テキストから口、上面、頭のアニメーションパラメータを個別に生成します。話者固有の段階では、さまざまな個人に合わせたビデオを合成するための3D顔モデルガイド付き注意ネットワークを提示します。アニメーションパラメータを入力として受け取り、注意マスクを利用して、入力された個人の顔の表情の変化を操作します。さらに、視覚的な動き（つまり、顔の表情の変化や頭の動き）と音声の間の本物の対応をより適切に確立するために、特定の個人の長いビデオに依存する代わりに、高精度のモーションキャプチャデータセットを活用します。視覚的および音声的な対応を達成した後、エンドツーエンドの方法でネットワークを効果的にトレーニングできます。定性的および定量的結果に関する広範な実験は、私たちのアルゴリズムが、音声リズムに応じたさまざまな表情や頭の動きを含む高品質のフォトリアリスティックなトーキングヘッドビデオを実現し、最先端のパフォーマンスを上回っていることを示しています。

In this paper, we propose a novel text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions in accordance with contextual sentiments as well as speech rhythm and pauses. To be specific, our framework consists of a speaker-independent stage and a speaker-specific stage. In the speaker-independent stage, we design three parallel networks to generate animation parameters of the mouth, upper face, and head from texts, separately. In the speaker-specific stage, we present a 3D face model guided attention network to synthesize videos tailored for different individuals. It takes the animation parameters as input and exploits an attention mask to manipulate facial expression changes for the input individuals. Furthermore, to better establish authentic correspondences between visual motions (i.e., facial expression changes and head movements) and audios, we leverage a high-accuracy motion capture dataset instead of relying on long videos of specific individuals. After attaining the visual and audio correspondences, we can effectively train our network in an end-to-end fashion. Extensive experiments on qualitative and quantitative results demonstrate that our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms and outperforms the state-of-the-art.

updated: Fri Apr 16 2021 09:44:12 GMT+0000 (UTC)

published: Fri Apr 16 2021 09:44:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト