Audio-Driven Talking Face Video Generation with Dynamic Convolution Kernels

Zipeng Ye; Mengfei Xia; Ran Yi; Juyong Zhang; Yu-Kun Lai; Xuwei Huang; Guoxin Zhang; Yong-jin Liu

動的畳み込みカーネルを使用した音声駆動のトーキングフェイスビデオ生成

この論文では、畳み込みニューラルネットワークのための動的畳み込みカーネル（DCK）戦略を提示します。提案されたDCKで完全畳み込みネットワークを使用すると、マルチモーダルソース（つまり、比類のないオーディオとビデオ）から高品質のトーキングフェイスビデオをリアルタイムで生成でき、トレーニング済みモデルはさまざまなID、頭の姿勢、オーディオを入力します。私たちが提案するDCKは、オーディオ駆動のトーキングフェイスビデオ生成用に特別に設計されており、シンプルでありながら効果的なエンドツーエンドシステムを実現します。また、DCKが機能する理由を解釈するための理論的分析も提供します。実験結果は、私たちの方法が60fpsの背景を持つ高品質の話す顔のビデオを生成できることを示しています。私たちの方法と最先端の方法の比較と評価は、私たちの方法の優位性を示しています。

In this paper, we present a dynamic convolution kernel (DCK) strategy for convolutional neural networks. Using a fully convolutional network with the proposed DCKs, high-quality talking-face video can be generated from multi-modal sources (i.e., unmatched audio and video) in real time, and our trained model is robust to different identities, head postures, and input audios. Our proposed DCKs are specially designed for audio-driven talking face video generation, leading to a simple yet effective end-to-end system. We also provide a theoretical analysis to interpret why DCKs work. Experimental results show that our method can generate high-quality talking-face video with background at 60 fps. Comparison and evaluation between our method and the state-of-the-art methods demonstrate the superiority of our method.

updated: Sun Jan 16 2022 07:07:59 GMT+0000 (UTC)

published: Sun Jan 16 2022 07:07:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト