StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

Jun Ling; Xu Tan; Liyang Chen; Runnan Li; Yuchao Zhang; Sheng Zhao; Li Song

StableFace: 話し顔生成のための動作安定性の分析と改善

以前の音声駆動の話し顔生成方法は、合成ビデオの視覚的品質とリップシンク品質の改善において大きな進歩を遂げましたが、話し顔ビデオの現実性を大きく損なう唇の動きのジッターにはあまり注意を払っていません。モーションジッターの原因と、問題を軽減する方法は?このホワイトペーパーでは、3D 顔表現を使用して入力オーディオと出力ビデオを橋渡しし、一連の効果的な設計でモーションの安定性を向上させる最先端のパイプラインに基づいて、モーションジッターの問題に関する体系的な分析を行います。合成された話している顔のビデオで、いくつかの問題がジッターにつながる可能性があることがわかりました。1) 入力 3D 顔表現からのジッター。 2) トレーニングと推論の不一致。 3) ビデオフレーム間の依存関係モデリングの欠如。したがって、この問題に対処するための 3 つの効果的なソリューションを提案します。 2) トレーニングでニューラルレンダラーの入力データに拡張浸食を追加して、推論の歪みをシミュレートし、不一致を減らします。 3) ビデオフレーム間の依存関係をモデル化するために、オーディオ融合トランスジェネレータを開発します。さらに、話している顔のビデオの動きのジッターを測定するための既製のメトリックがないことを考慮して、分散加速度の逆数を計算することにより、モーションのジッターを定量的に測定する客観的なメトリック (Motion Stability Index、MSI) を考案します。広範な実験結果は、以前のシステムよりも優れた品質で、動きが安定した顔ビデオ生成に関する私たちの方法の優位性を示しています。

While previous speech-driven talking face generation methods have made significant progress in improving the visual quality and lip-sync quality of the synthesized videos, they pay less attention to lip motion jitters which greatly undermine the realness of talking face videos. What causes motion jitters, and how to mitigate the problem? In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs. We find that several issues can lead to jitters in synthesized talking face video: 1) jitters from the input 3D face representations; 2) training-inference mismatch; 3) lack of dependency modeling among video frames. Accordingly, we propose three effective solutions to address this issue: 1) we propose a gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) we add augmented erosions on the input data of the neural renderer in training to simulate the distortion in inference to reduce mismatch; 3) we develop an audio-fused transformer generator to model dependency among video frames. Besides, considering there is no off-the-shelf metric for measuring motion jitters in talking face video, we devise an objective metric (Motion Stability Index, MSI), to quantitatively measure the motion jitters by calculating the reciprocal of variance acceleration. Extensive experimental results show the superiority of our method on motion-stable face video generation, with better quality than previous systems.

updated: Mon Aug 29 2022 16:56:35 GMT+0000 (UTC)

published: Mon Aug 29 2022 16:56:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト