Audio-driven Neural Gesture Reenactment with Video Motion Graphs

Yang Zhou; Jimei Yang; Dingzeyu Li; Jun Saito; Deepali Aneja; Evangelos Kalogerakis

ビデオモーショングラフによるオーディオ駆動のニューラルジェスチャの再現

人間の発話には、腕や手のジェスチャーなどの身体のジェスチャーが伴うことがよくあります。ターゲットの音声音声に一致するジェスチャで高品質のビデオを再現する方法を紹介します。私たちの方法の重要なアイデアは、クリップ間の有効な遷移をエンコードする新しいビデオモーショングラフを介して、参照ビデオからクリップを分割して再構築することです。再現で異なるクリップをシームレスに接続するために、2つのクリップ間のステッチフレームの周りにビデオフレームを合成するポーズ認識ビデオブレンディングネットワークを提案します。さらに、再現されたフレームの最適な順序を見つけるために、オーディオベースのジェスチャ検索アルゴリズムを開発しました。私たちのシステムは、オーディオのリズムとスピーチの内容の両方と一致する再現を生成します。合成されたビデオ品質を定量的、定性的、およびユーザー調査で評価し、以前の作業やベースラインと比較して、私たちの方法がターゲットオーディオとはるかに高い品質と一貫性のあるビデオを生成することを示しています。

Human speech is often accompanied by body gestures including arm and hand gestures. We present a method that reenacts a high-quality video with gestures matching a target speech audio. The key idea of our method is to split and re-assemble clips from a reference video through a novel video motion graph encoding valid transitions between clips. To seamlessly connect different clips in the reenactment, we propose a pose-aware video blending network which synthesizes video frames around the stitched frames between two clips. Moreover, we developed an audio-based gesture searching algorithm to find the optimal order of the reenacted frames. Our system generates reenactments that are consistent with both the audio rhythms and the speech content. We evaluate our synthesized video quality quantitatively, qualitatively, and with user studies, demonstrating that our method produces videos of much higher quality and consistency with the target audio compared to previous work and baselines.

updated: Sat Jul 23 2022 14:02:57 GMT+0000 (UTC)

published: Sat Jul 23 2022 14:02:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト