GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Tenglong Ao; Zeyi Zhang; Libin Liu

GestureDiffuCLIP: CLIP Latents を使用したジェスチャー拡散モデル

定型化された共同音声ジェスチャーの自動生成は、最近注目を集めています。以前のシステムでは通常、定義済みのテキストラベルまたはサンプルモーションクリップを使用してスタイルを制御できますが、これらはユーザーの意図を正確に伝えるには十分な柔軟性を備えていないことがよくあります。この作業では、GestureDiffuCLIP を紹介します。これは、現実的で様式化された共同音声ジェスチャを柔軟なスタイルコントロールで合成するためのニューラルネットワークフレームワークです。大規模な Contrastive-Language-Image-Pre-training (CLIP) モデルの機能を活用し、テキストやサンプルモーションなどの複数の入力モダリティから効率的なスタイル表現を抽出する、新しい CLIP ガイド付きメカニズムを提示します。クリップ、またはビデオ。私たちのシステムは、潜在的な拡散モデルを学習して高品質のジェスチャーを生成し、適応型インスタンス正規化 (AdaIN) レイヤーを介してスタイルの CLIP 表現をジェネレーターに注入します。さらに、対照的な学習に基づいて意味的に正しいジェスチャ生成を保証するジェスチャ転写アラインメントメカニズムを考案します。私たちのシステムは、個々の身体部分のきめ細かなスタイル制御を可能にするように拡張することもできます.さまざまなスタイル記述に対するモデルの柔軟性と一般化可能性を示す広範な例を示します。ユーザー調査では、私たちのシステムが、人間の肖像、適切さ、およびスタイルの正確さに関する最先端のアプローチよりも優れていることを示しています.

The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. We leverage the power of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and present a novel CLIP-guided mechanism that extracts efficient style representations from multiple input modalities, such as a piece of text, an example motion clip, or a video. Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator via an adaptive instance normalization (AdaIN) layer. We further devise a gesture-transcript alignment mechanism that ensures a semantically correct gesture generation based on contrastive learning. Our system can also be extended to allow fine-grained style control of individual body parts. We demonstrate an extensive set of examples showing the flexibility and generalizability of our model to a variety of style descriptions. In a user study, we show that our system outperforms the state-of-the-art approaches regarding human likeness, appropriateness, and style correctness.

updated: Sun Mar 26 2023 03:35:46 GMT+0000 (UTC)

published: Sun Mar 26 2023 03:35:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト