Audio-Driven Co-Speech Gesture Video Generation

Xian Liu; Qianyi Wu; Hang Zhou; Yuanqi Du; Wayne Wu; Dahua Lin; Ziwei Liu

オーディオ主導のコスピーチジェスチャビデオ生成

共同スピーチジェスチャは、ヒューマンマシンインタラクションおよびデジタルエンターテイメントにとって非常に重要です。以前の研究では、主に音声オーディオを人間の骨格 (2D キーポイントなど) にマッピングしていましたが、画像ドメインで話者のジェスチャーを直接生成することは未解決のままです。この作業では、オーディオ駆動の共同発話ジェスチャビデオ生成のこの困難な問題を正式に定義して研究します。つまり、統合フレームワークを使用して、発話オーディオによって駆動されるスピーカーイメージシーケンスを生成します。私たちの重要な洞察は、コスピーチのジェスチャーは、一般的なモーションパターンと微妙なリズミカルなダイナミクスに分解できるということです。この目的のために、我々は新しいフレームワーク、Audio-driveN Gesture video gEneration (ANGIE) を提案し、再利用可能なコスピーチジェスチャパターンときめの細かいリズミカルな動きを効果的にキャプチャします。忠実度の高い画像シーケンスの生成を実現するために、構造的な人体の前処理 (2D スケルトンなど) の代わりに、教師なしモーション表現を利用します。具体的には、1) 我々はベクトル量子化された動き抽出器 (VQ-Motion Extractor) を提案し、暗示的動き表現からコードブックへの一般的な共同音声ジェスチャパターンを要約します。 2) さらに、微妙な韻律的な動きの詳細を補完するために、動きの洗練を伴うコスピーチジェスチャ GPT (Co-Speech GPT) が考案されています。広範な実験により、私たちのフレームワークが現実的で鮮やかな共同発話ジェスチャービデオをレンダリングすることが実証されています。デモビデオとその他のリソースは、https://alvinliu0.github.io/projects/ANGIE にあります。

Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE

updated: Mon Dec 05 2022 15:28:22 GMT+0000 (UTC)

published: Mon Dec 05 2022 15:28:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト