FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

Kazi Injamamul Haque; Zerrin Yumak

FaceXHuBERT: 自己教師あり音声表現学習を使用したテキストレス音声駆動 E(X)pressive 3D フェイシャルアニメーション合成

このホワイトペーパーでは、FaceXHuBERT を紹介します。これは、テキストのない音声主導の 3D 顔アニメーション生成方法であり、音声のパーソナライズされた微妙な手がかり (アイデンティティ、感情、躊躇など) をキャプチャできます。また、バックグラウンドノイズに対しても非常に堅牢で、さまざまな状況 (複数の人が話しているなど) で録音された音声を処理できます。最近のアプローチでは、音声とテキストの両方を入力として考慮してエンドツーエンドのディープラーニングを採用し、顔全体のフェイシャルアニメーションを生成します。ただし、公開されている表現力豊かなオーディオ 3D フェイシャルアニメーションデータセットが不足しているため、大きなボトルネックが生じます。結果として得られるアニメーションには、正確な口パク、表現力、人物固有の情報、および一般化に関する問題がまだ残っています。大規模なレキシコンを使用せずに、語彙情報と非語彙情報の両方をオーディオに組み込むことを可能にするトレーニングプロセスで、自己教師ありの事前トレーニング済みの HuBERT モデルを効果的に採用します。さらに、バイナリの感情条件とスピーカーのアイデンティティを使用してトレーニングをガイドすることで、最も微妙な顔の動きを区別します。グラウンドトゥルースや最先端の作業と比較して、広範な客観的および主観的な評価を実施しました。知覚的なユーザー調査では、最先端の方法と比較して、78% の確率でアニメーションのリアリズムに関して、私たちのアプローチが優れた結果を生み出すことが示されています。さらに、トランスフォーマーなどの複雑なシーケンシャルモデルを使用する必要がないため、この方法は 4 倍高速です。論文を読む前に、補足ビデオを見ることを強くお勧めします。また、実装コードと評価コードを GitHub リポジトリリンクで提供しています。

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that allows to capture personalized and subtle cues in speech (e.g. identity, emotion and hesitation). It is also very robust to background noise and can handle audio recorded in a variety of situations (e.g. multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate facial animation for the whole face. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-synching, expressivity, person-specific information and generalizability. We effectively employ self-supervised pretrained HuBERT model in the training process that allows us to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Additionally, guiding the training with a binary emotion condition and speaker identity distinguishes the tiniest subtle facial motion. We carried out extensive objective and subjective evaluation in comparison to ground-truth and state-of-the-art work. A perceptual user study demonstrates that our approach produces superior results with respect to the realism of the animation 78% of the time in comparison to the state-of-the-art. In addition, our method is 4 times faster eliminating the use of complex sequential models such as transformers. We strongly recommend watching the supplementary video before reading the paper. We also provide the implementation and evaluation codes with a GitHub repository link.

updated: Thu Mar 09 2023 17:05:19 GMT+0000 (UTC)

published: Thu Mar 09 2023 17:05:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト