Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

Kun Su; Kaizhi Qian; Eli Shlizerman; Antonio Torralba; Chuang Gan

ビデオからの衝撃音合成のための物理駆動型拡散モデル

物理的なオブジェクトの相互作用から発せられる音をモデリングすることは、現実世界と仮想世界での没入型の知覚体験にとって重要です。衝撃音合成の従来の方法では、物理シミュレーションを使用して、サウンドを表現および合成できる一連の物理パラメーターを取得します。ただし、オブジェクトのジオメトリと衝突位置の両方の詳細が必要です。これは、現実の世界ではめったに利用できず、一般的なビデオからの衝突音の合成には適用できません。一方、既存のビデオ主導のディープラーニングベースのアプローチでは、物理学の知識が不足しているため、視覚コンテンツと衝撃音の間の弱い対応しか捉えることができませんでした。この作業では、サイレントビデオクリップの忠実度の高い衝撃音を合成できる物理駆動の拡散モデルを提案します。ビデオコンテンツに加えて、衝撃音の合成手順をガイドするために追加の物理プリアを使用することを提案します。物理事前確率には、複雑な設定を行わずに現実世界のノイズの多い衝撃音の例から直接推定される物理パラメーターと、ニューラルネットワークを介して音環境を解釈する学習済み残差パラメーターの両方が含まれます。さらに、特定のトレーニングと推論戦略を備えた新しい拡散モデルを実装して、物理学の事前情報と視覚情報を組み合わせて衝撃音を合成します。実験結果は、リアルな衝撃音の生成において、モデルがいくつかの既存のシステムよりも優れていることを示しています。さらに重要なことに、物理ベースの表現は完全に解釈可能で透過的であるため、サウンド編集を柔軟に実行できます。

Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly.

updated: Wed Mar 29 2023 17:59:53 GMT+0000 (UTC)

published: Wed Mar 29 2023 17:59:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト