Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Dan Bigioi; Shubhajit Basak; Hugh Jordan; Rachel McDonnell; Peter Corcoran

音声調整された拡散モデルによる音声主導のビデオ編集

この論文では、ノイズ除去拡散モデルを使用した、エンドツーエンドの音声駆動ビデオ編集の方法を提案します。人が話しているビデオが与えられた場合、顔のランドマークや 3D 顔モデルなどの中間構造表現に依存することなく、個別の音声音声録音に応答して、人の唇と顎の動きを再同期することを目指します。これは、ノイズ除去拡散モデルを音声スペクトル機能で調整して、同期した顔の動きを生成することで可能になることを示しています。構造化されていない単一話者のビデオ編集のタスクで説得力のある結果を達成し、既製の読唇モデルを使用して 45% の単語エラー率を達成しました。さらに、アプローチをマルチスピーカードメインに拡張する方法を示します。私たちの知る限り、これはノイズ除去拡散モデルをオーディオ主導のビデオ編集のタスクに適用する可能性を探る最初の作業です。

In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing.

updated: Tue Jan 10 2023 12:01:20 GMT+0000 (UTC)

published: Tue Jan 10 2023 12:01:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト