Towards Realistic Visual Dubbing with Heterogeneous Sources

Tianyi Xie; Liucheng Liao; Cheng Bi; Benlai Tang; Xiang Yin; Jianfei Yang; Mingjie Wang; Jiali Yao; Yang Zhang; Zejun Ma

異種ソースを使用したリアルなビジュアルダビングに向けて

数ショットの視覚的吹き替えのタスクは、唇の動きを、話しているヘッドビデオの任意の音声入力と同期させることに焦点を当てています。現在のアプローチは中程度に改善されていますが、通常、ビデオとオーディオの高品質の同種データソースが必要であるため、異種データを十分に活用できません。実際には、オーディオが破損したビデオや画像がぼやけたビデオなど、場合によっては完全な相同データを収集するのは難しい場合があります。この種のデータを探索し、忠実度の高い数ショットの視覚的吹き替えをサポートするために、この論文では、異種データをマイニングする柔軟性が高い、シンプルでありながら効率的な2ステージフレームワークを新たに提案します。具体的には、私たちの2段階のパラダイムは、潜在的な表現の中間の事前として顔のランドマークを採用し、現実的な話す頭の生成のコアタスクから唇の動きの予測を解きほぐします。これにより、私たちの方法は、より簡単に取得できるより利用可能な異種データを使用して、2段階のサブネットワークのトレーニングコーパスを独立して利用することを可能にします。さらに、解きほぐしのおかげで、私たちのフレームワークは、与えられた話し手の頭をさらに微調整することを可能にし、それによって、最終的な合成結果でより良い話者のアイデンティティを維持することにつながります。さらに、提案された方法は、外観の特徴を他の人からターゲットの話者に移すこともできる。広範な実験結果は、最先端の音声と同期した非常にリアルなビデオを生成する上で、提案された方法の優位性を示しています。

The task of few-shot visual dubbing focuses on synchronizing the lip movements with arbitrary speech input for any talking head video. Albeit moderate improvements in current approaches, they commonly require high-quality homologous data sources of videos and audios, thus causing the failure to leverage heterogeneous data sufficiently. In practice, it may be intractable to collect the perfect homologous data in some cases, for example, audio-corrupted or picture-blurry videos. To explore this kind of data and support high-fidelity few-shot visual dubbing, in this paper, we novelly propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data. Specifically, our two-stage paradigm employs facial landmarks as intermediate prior of latent representations and disentangles the lip movements prediction from the core task of realistic talking head generation. By this means, our method makes it possible to independently utilize the training corpus for two-stage sub-networks using more available heterogeneous data easily acquired. Besides, thanks to the disentanglement, our framework allows a further fine-tuning for a given talking head, thereby leading to better speaker-identity preserving in the final synthesized results. Moreover, the proposed method can also transfer appearance features from others to the target speaker. Extensive experimental results demonstrate the superiority of our proposed method in generating highly realistic videos synchronized with the speech over the state-of-the-art.

updated: Mon Jan 17 2022 07:57:24 GMT+0000 (UTC)

published: Mon Jan 17 2022 07:57:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト