Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Dan Bigioi; Shubhajit Basak; Michał Stypułkowski; Maciej Zięba; Hugh Jordan; Rachel McDonnell; Peter Corcoran

音声調整された拡散モデルによる音声駆動のビデオ編集

拡散モデルを使用した視覚生成タスクの最近の開発からインスピレーションを得て、ノイズ除去拡散モデルを使用したエンドツーエンドの音声駆動ビデオ編集方法を提案します。話している人のビデオと別の聴覚音声記録が与えられると、顔のランドマークや 3D 顔モデルなどの中間構造表現に依存することなく、唇と顎の動きが再同期されます。オーディオメルスペクトル特徴に基づいてノイズ除去拡散モデルを調整して、同期した顔の動きを生成することによって、これが可能であることを示します。概念実証の結果は、シングルスピーカーとマルチスピーカーの両方のビデオ編集で実証され、CREMA-D オーディオビジュアルデータセットのベースラインモデルを提供します。私たちの知る限り、これはオーディオ主導のビデオ編集タスクにエンドツーエンドのノイズ除去拡散モデルを適用する実現可能性を実証し検証した最初の研究です。

Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.

updated: Thu May 11 2023 11:56:42 GMT+0000 (UTC)

published: Tue Jan 10 2023 12:01:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト