High-Speed and High-Quality Text-to-Lip Generation

Jinglin Liu; Zhiying Zhu; Yi Ren; Zhou Zhao

高速で高品質のテキストからリップへの生成

話す顔の生成の重要な要素として、唇の動きの生成は、生成される話す顔のビデオの自然さと一貫性を決定します。以前の文献は主に音声から唇への生成に焦点を合わせていますが、テキストから唇への生成（T2L）は不足しています。 T2Lは困難な作業であり、既存のエンドツーエンドの作業は、注意メカニズムと自己回帰（AR）デコード方法に依存します。ただし、ARデコード方式では、以前に生成されたフレームを条件とする現在のリップフレームが生成されます。これは、本質的に推論速度を妨げ、エラーの伝播により、生成されたリップフレームの品質にも悪影響を及ぼします。これにより、並列T2L生成の研究が促進されます。この作業では、高速かつ高品質のテキストから唇への生成（HH-T2L）のための新しい並列デコードモデルを提案します。具体的には、エンコードされた言語的特徴の持続時間を予測し、エンコードされた言語的特徴を条件とするターゲットリップフレームを、非自己回帰的な方法で持続時間とともにモデル化します。さらに、構造的類似性指数の損失と敵対的学習を組み込んで、生成された唇のフレームの知覚品質を改善し、ぼやけた予測の問題を軽減します。 GRIDおよびTCD-TIMITデータセットで実施された広範な実験は、1）HH-T2Lが、最先端のAR T2LモデルDualLipと比較して競争力のある品質で唇の動きを生成し、ベースラインARモデルTransformerT2Lを大幅に上回っていることを示しています。エラー伝播問題の軽減。 2）推論速度において明確な優位性を示します（TCD-TIMITのDualLipよりも平均19倍高速化）。

As a key component of talking face generation, lip movements generation determines the naturalness and coherence of the generated talking face video. Prior literature mainly focuses on speech-to-lip generation while there is a paucity in text-to-lip (T2L) generation. T2L is a challenging task and existing end-to-end works depend on the attention mechanism and autoregressive (AR) decoding manner. However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation. This encourages the research of parallel T2L generation. In this work, we propose a novel parallel decoding model for high-speed and high-quality text-to-lip generation (HH-T2L). Specifically, we predict the duration of the encoded linguistic features and model the target lip frames conditioned on the encoded linguistic features with their duration in a non-autoregressive manner. Furthermore, we incorporate the structural similarity index loss and adversarial learning to improve perceptual quality of generated lip frames and alleviate the blurry prediction problem. Extensive experiments conducted on GRID and TCD-TIMIT datasets show that 1) HH-T2L generates lip movements with competitive quality compared with the state-of-the-art AR T2L model DualLip and exceeds the baseline AR model TransformerT2L by a notable margin benefiting from the mitigation of the error propagation problem; and 2) exhibits distinct superiority in inference speed (an average speedup of 19× than DualLip on TCD-TIMIT).

updated: Wed Jul 14 2021 16:44:04 GMT+0000 (UTC)

published: Wed Jul 14 2021 16:44:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト