Text2Video: Text-driven Talking-head Video Synthesis with Personalized Phoneme-Pose Dictionary

Sibo Zhang; Jiahong Yuan; Miao Liao; Liangjun Zhang

Text2Video：パーソナライズされた音素ポーズ辞書を使用したテキスト駆動型トーキングヘッドビデオシンセサイザー

ディープラーニングテクノロジーの進歩に伴い、オーディオまたはテキストからの自動ビデオ生成は、新たな有望な研究トピックになっています。この論文では、テキストからビデオを合成するための新しいアプローチを紹介します。このメソッドは、音素ポーズ辞書を作成し、生成的敵対的ネットワーク（GAN）をトレーニングして、補間された音素ポーズからビデオを生成します。オーディオ駆動のビデオ生成アルゴリズムと比較して、私たちのアプローチには多くの利点があります。1）オーディオ駆動のアプローチで使用されるトレーニングデータのごく一部しか必要としません。 2）より柔軟性があり、スピーカーのバリエーションによる脆弱性の影響を受けません。 3）前処理、トレーニング、推論の時間を大幅に短縮します。ベンチマークデータセットと独自のデータセットで、提案された方法を最先端の話す顔生成方法と比較するために、広範な実験を実行します。結果は、私たちのアプローチの有効性と優位性を示しています。

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

updated: Sat Jan 22 2022 05:06:54 GMT+0000 (UTC)

published: Thu Apr 29 2021 19:54:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト