TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles

Yifeng Ma; Suzhen Wang; Yu Ding; Bowen Ma; Tangjie Lv; Changjie Fan; Zhipeng Hu; Zhidong Deng; Xin Yu

TalkCLIP: テキストガイドによる表現力豊かな発話スタイルによるトーキングヘッドの生成

表情を指定したトーキングヘッドビデオを生成するために、従来の音声駆動型のワンショットトーキングヘッドメソッドでは、話し方 (表情) が一致するリファレンスビデオを使用する必要がありました。ただし、目的のスタイルのビデオを見つけるのは簡単ではなく、アプリケーションが制限される可能性があります。この作業では、音声の表現が自然言語によって指定される、TalkCLIP と呼ばれる表現制御可能なワンショットトーキングヘッド法を提案します。これにより、目的の話し方のビデオを検索する難しさが大幅に軽減されます。ここでは、最初にテキストとビデオのペアになったトーキングヘッドデータセットを構築します。このデータセットには、各ビデオにプロンプトに似た代替の説明があります。具体的には、私たちの説明には、大まかな感情の注釈と顔のアクションユニット (AU) ベースのきめ細かい注釈が含まれます。次に、最初に自然言語の説明をCLIPテキスト埋め込みスペースに投影し、次にテキスト埋め込みを話し方の表現に合わせるCLIPベースのスタイルエンコーダーを導入します。広範なテキスト知識が CLIP によってエンコードされているため、この方法は一般化して、トレーニング中に説明が見られなかった話し方を推測することさえできます。広範な実験により、私たちの方法が、テキストの説明によって導かれる鮮やかな表情を持つ写真のようにリアルなトーキングヘッドを生成する高度な機能を実現することが実証されています。

In order to produce facial-expression-specified talking head videos, previous audio-driven one-shot talking head methods need to use a reference video with a matching speaking style (i.e., facial expressions). However, finding videos with a desired style may not be easy, potentially restricting their application. In this work, we propose an expression-controllable one-shot talking head method, dubbed TalkCLIP, where the expression in a speech is specified by the natural language. This would significantly ease the difficulty of searching for a video with a desired speaking style. Here, we first construct a text-video paired talking head dataset, in which each video has alternative prompt-alike descriptions. Specifically, our descriptions involve coarse-level emotion annotations and facial action unit (AU) based fine-grained annotations. Then, we introduce a CLIP-based style encoder that first projects natural language descriptions to the CLIP text embedding space and then aligns the textual embeddings to the representations of speaking styles. As extensive textual knowledge has been encoded by CLIP, our method can even generalize to infer a speaking style whose description has not been seen during training. Extensive experiments demonstrate that our method achieves the advanced capability of generating photo-realistic talking heads with vivid facial expressions guided by text descriptions.

updated: Sat Apr 01 2023 15:10:02 GMT+0000 (UTC)

published: Sat Apr 01 2023 15:10:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト