Self-Supervised Visual Acoustic Matching

Arjun Somayazulu; Changan Chen; Kristen Grauman

自己監視型視覚音響マッチング

音響マッチングの目的は、オーディオクリップを再合成して、ターゲットの音響環境で録音されたかのように聞こえるようにすることです。既存の方法は、ソース環境とターゲット環境の両方で音声が観察されるペアのトレーニングデータへのアクセスを前提としていますが、これによりトレーニングデータの多様性が制限されるか、ペアのサンプルを作成するためにシミュレートされたデータやヒューリスティックの使用が必要になります。私たちは、トレーニングサンプルにターゲットシーンの画像とオーディオのみが含まれ、参照用の音響的に不一致なソースオーディオが含まれない、視覚的音響マッチングに対する自己監視型アプローチを提案します。私たちのアプローチは、条件付き GAN フレームワークと、バイアスをかけられたオーディオの残留音響情報のレベルを定量化する新しいメトリクスを介して、室内音響を解きほぐし、オーディオをターゲット環境に再合成する方法を共同で学習します。実際の Web データまたはシミュレートされたデータを使用したトレーニングにより、複数の困難なデータセットや現実世界のさまざまなオーディオおよび環境で最先端のパフォーマンスを上回るパフォーマンスを実証します。

Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.

updated: Thu Jul 27 2023 17:59:59 GMT+0000 (UTC)

published: Thu Jul 27 2023 17:59:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト