Visual Acoustic Matching

Changan Chen; Ruohan Gao; Paul Calamia; Kristen Grauman

ビジュアルアコースティックマッチング

オーディオクリップをターゲット環境で録音されたようなサウンドに変換するビジュアルアコースティックマッチングタスクを紹介します。ターゲット環境の画像とソースオーディオの波形が与えられた場合、目標は、目に見えるジオメトリとマテリアルによって示唆されるように、ターゲットルームの音響に一致するようにオーディオを再合成することです。この新しいタスクに対処するために、オーディオビジュアルアテンションを使用してビジュアルプロパティをオーディオに注入し、リアルなオーディオ出力を生成するクロスモーダルトランスフォーマーモデルを提案します。さらに、音響的に不一致のオーディオがないにもかかわらず、実際のWebビデオから音響マッチングを学習できる自己監視型のトレーニング目標を考案します。私たちのアプローチは、人間の音声を画像に描かれたさまざまな現実世界の環境にうまく変換し、従来の音響マッチングとより厳重に監視されたベースラインの両方を上回っていることを示しています。

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.

updated: Mon Jun 13 2022 19:08:51 GMT+0000 (UTC)

published: Mon Feb 14 2022 17:05:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト