StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Dongchan Min; Minyoung Song; Sung Ju Hwang

StyleTalker: ワンショットスタイルベースのオーディオ主導のトーキングヘッドビデオ生成

StyleTalker は、正確にオーディオ同期された唇の形、リアルな頭のポーズ、まばたきを使用して、単一の参照画像から話している人のビデオを合成できる、新しいオーディオ駆動の話している頭の生成モデルです。具体的には、事前トレーニング済みのイメージジェネレーターとイメージエンコーダーを活用して、指定された音声を忠実に反映するトーキングヘッドビデオの潜在コードを推定します。これは、いくつかの新しく考案されたコンポーネントによって可能になります。1) 正確な唇の同期のための対照的なリップシンク弁別器、2) モーションを独立して操作できるように、唇の動きから解きほぐされた潜在的なモーション空間を学習する条件付き順次変分オートエンコーダーアイデンティティを維持しながら唇の動き。 3) 複雑なオーディオからモーションへのマルチモーダル潜在空間を学習するための正規化フローで拡張された自己回帰事前分布。これらのコンポーネントを搭載した StyleTalker は、別のモーションソースビデオが与えられたときにモーションコントロール可能な方法でトーキングヘッドビデオを生成できるだけでなく、入力オーディオからリアルなモーションを推測することにより、完全にオーディオ主導の方法でトーキングヘッドビデオを生成することもできます。大規模な実験とユーザー調査を通じて、私たちのモデルは、入力オーディオと正確にリップシンクされ、最先端のベースラインを大幅に上回る印象的な知覚品質のトーキングヘッドビデオを合成できることを示しています.

We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.

updated: Tue Aug 23 2022 12:49:01 GMT+0000 (UTC)

published: Tue Aug 23 2022 12:49:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト