Speech2Video: Cross-Modal Distillation for Speech to Video Generation

Shijing Si; Jianzong Wang; Xiaoyang Qu; Ning Cheng; Wenqi Wei; Xinghua Zhu; Jing Xiao

Speech2Video：音声からビデオへの生成のためのクロスモーダル蒸留

この論文は、スピーチだけから顔のビデオ生成を話すという新しいタスクを調査します。音声からビデオへの生成技術は、エンターテインメント、カスタマーサービス、およびヒューマンコンピュータインタラクション業界で興味深いアプリケーションを生み出す可能性があります。実際、スピーチの音色、アクセント、速度には、話者の外見に関連する豊富な情報が含まれている可能性があります。課題は主に、オーディオ信号から明確な視覚的属性を解きほぐすことにあります。この記事では、ラベルのないビデオ入力から解きほぐされた感情情報とアイデンティティ情報を抽出するための、軽量のクロスモーダル蒸留法を提案します。抽出された特徴は、生成的敵対的ネットワークによって、話す顔のビデオクリップに統合されます。慎重に作成された弁別器を使用して、提案されたフレームワークは現実的な生成結果を実現します。観察された個人を使った実験は、提案されたフレームワークがスピーチだけから感情的な表現をキャプチャし、ビデオ出力で自発的な顔の動きを生成することを示しました。スピーチが話者の静止画像と組み合わされるベースライン方法と比較して、提案されたフレームワークの結果はほとんど区別がつかない。ユーザーの研究はまた、提案された方法が、生成されたビデオの感情表現に関して既存のアルゴリズムよりも優れていることを示しています。

This paper investigates a novel task of talking face video generation solely from speeches. The speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries. Indeed, the timbre, accent and speed in speeches could contain rich information relevant to speakers' appearance. The challenge mainly lies in disentangling the distinct visual attributes from audio signals. In this article, we propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs. The extracted features are then integrated by a generative adversarial network into talking face video clips. With carefully crafted discriminators, the proposed framework achieves realistic generation results. Experiments with observed individuals demonstrated that the proposed framework captures the emotional expressions solely from speeches, and produces spontaneous facial motion in the video output. Compared to the baseline method where speeches are combined with a static image of the speaker, the results of the proposed framework is almost indistinguishable. User studies also show that the proposed method outperforms the existing algorithms in terms of emotion expression in the generated videos.

updated: Sat Jul 10 2021 10:27:26 GMT+0000 (UTC)

published: Sat Jul 10 2021 10:27:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト