Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks

Jianrong Wang; Yaxin Zhao; Li Liu; Tianyi Xu; Qi Li; Sen Li

記憶共有および注意強化ネットワークに基づく感情的なトーキングヘッドの生成

オーディオクリップと参照顔画像が与えられた場合、トーキングヘッド生成の目標は、忠実度の高いトーキングヘッドビデオを生成することです。トーキングヘッドビデオを生成するいくつかのオーディオ主導の方法は過去にある程度の成果を上げていますが、そのほとんどは口唇とオーディオの同期のみに焦点を当てており、対象者の顔の表情を再現する機能がありません。この目的を達成するために、メモリ共有感情特徴抽出器 (MSEF) と U-net に基づく注意強化翻訳器 (AATU) から構成されるトーキングヘッド生成モデルを提案します。第一に、MSEF は、音声から暗黙的な感情補助特徴を抽出して、より正確な感情的な顔のランドマークを推定できます。~第二に、AATU は、推定されたランドマークとフォトリアリスティックなビデオフレームの間の変換器として機能します。広範な定性的および定量的実験により、提案された方法が以前の研究よりも優れていることが示されました。コードは公開されます。

Given an audio clip and a reference face image, the goal of the talking head generation is to generate a high-fidelity talking head video. Although some audio-driven methods of generating talking head videos have made some achievements in the past, most of them only focused on lip and audio synchronization and lack the ability to reproduce the facial expressions of the target person. To this end, we propose a talking head generation model consisting of a Memory-Sharing Emotion Feature extractor (MSEF) and an Attention-Augmented Translator based on U-net (AATU). Firstly, MSEF can extract implicit emotional auxiliary features from audio to estimate more accurate emotional face landmarks.~Secondly, AATU acts as a translator between the estimated landmarks and the photo-realistic video frames. Extensive qualitative and quantitative experiments have shown the superiority of the proposed method to the previous works. Codes will be made publicly available.

updated: Tue Jun 06 2023 11:31:29 GMT+0000 (UTC)

published: Tue Jun 06 2023 11:31:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト