StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Yifeng Ma; Suzhen Wang; Zhipeng Hu; Changjie Fan; Tangjie Lv; Yu Ding; Zhidong Deng; Xin Yu

StyleTalk: 制御可能な発話スタイルを備えたワンショットトーキングヘッド生成

さまざまな人が、さまざまなパーソナライズされた話し方で話します。既存のワンショットトーキングヘッド手法は、リップシンク、自然な表情、安定した頭の動きにおいて大きな進歩を遂げましたが、最終的なトーキングヘッドビデオで多様な話し方を生成することはまだできていません。この問題に取り組むために、ワンショットスタイル制御可能な話し顔生成フレームワークを提案します。一言で言えば、任意の基準となる話し方のビデオから話し方を達成し、ワンショットのポートレートを動かして、基準となる話し方と別の音声で話すことを目指しています。具体的には、まず、スタイルリファレンスビデオの動的な顔の動きのパターンを抽出し、スタイルコードにエンコードするスタイルエンコーダーを開発します。その後、スタイル制御可能なデコーダーを導入して、音声コンテンツとスタイルコードから定型化された顔のアニメーションを合成します。リファレンスの話し方を生成されたビデオに統合するために、エンコードされたスタイルコードがそれに応じてフィードフォワードレイヤーの重みを調整できるようにする、スタイルを認識する適応トランスフォーマーを設計します。スタイルを意識した適応メカニズムのおかげで、デコード中にリファレンスの話し方を合成ビデオにうまく埋め込むことができます。広範な実験により、私たちの方法が、本物の視覚効果を実現しながら、1 つのポートレート画像とオーディオクリップだけから多様な話し方をしたトーキングヘッドビデオを生成できることが実証されました。プロジェクトページ: https://github.com/FuxiVirtualHuman/styletalk.

Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.

updated: Sat Jun 10 2023 14:37:49 GMT+0000 (UTC)

published: Tue Jan 03 2023 13:16:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト