High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Chao Xu; Junwei Zhu; Jiangning Zhang; Yue Han; Wenqing Chu; Ying Tai; Chengjie Wang; Zhifeng Xie; Yong Liu

マルチモーダル感情空間学習による高忠実度の一般化された感情的な話し顔の生成

最近、エモーショナルな話し顔の生成が注目されています。しかし、既存の方法では、感情条件としてワンホットコーディング、画像、または音声しか採用されていないため、実用的なアプリケーションでの柔軟な制御に欠けており、セマンティクスが限られているため、目に見えない感情スタイルを処理できません。ワンショット設定または生成された顔の品質を無視します。この論文では、より柔軟で一般化されたフレームワークを提案します。具体的には、テキストプロンプトの感情スタイルを補足し、Aligned Multi-modal Emotion エンコーダーを使用して、テキスト、画像、およびオーディオの感情モダリティを統合空間に埋め込みます。これは、CLIP から豊富なセマンティックプリアーを継承します。その結果、効果的なマルチモーダル感情空間学習は、テスト中に任意の感情モダリティをサポートし、目に見えない感情スタイルに一般化するのに役立ちます。さらに、感情状態とオーディオシーケンスを構造表現に接続するために、感情認識オーディオから3DMMへのコンバーターが提案されています。フォローされたスタイルベースの高忠実度感情面ジェネレーターは、任意の高解像度の現実的なアイデンティティを生成するように設計されています。私たちのテクスチャジェネレーターは、フローフィールドとアニメートされた顔を残差的に階層的に学習します。広範な実験により、感情制御における本手法の柔軟性と一般化、および高品質の顔合成の有効性が実証されました。

Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.

updated: Thu May 04 2023 05:59:34 GMT+0000 (UTC)

published: Thu May 04 2023 05:59:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト