Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

Xian Liu; Rui Qian; Hang Zhou; Di Hu; Weiyao Lin; Ziwei Liu; Bolei Zhou; Xiaowei Zhou

クロスモーダル干渉消去による野生の視覚音像定位

オーディオビジュアル音源のローカリゼーションのタスクは、オーディオ録音がクリーンな制約のあるシーンで十分に研究されています。ただし、実際のシナリオでは、オーディオは通常、画面外のサウンドやバックグラウンドノイズによって汚染されています。それらは、望ましいソースを識別し、視覚と音のつながりを構築する手順を妨害し、以前の研究を適用できなくします。この作業では、野生の視聴覚音源のローカリゼーションの問題に取り組む干渉消しゴム（IEr）フレームワークを提案します。重要なアイデアは、識別可能なオーディオ表現を再定義して刻むことにより、干渉を排除することです。具体的には、オーディオ信号の付加的な性質のために、単一のオーディオ表現のみを学習するという以前の慣行では不十分であることがわかります。したがって、Audio-Instance-Identifierモジュールを使用してオーディオ表現を拡張します。このモジュールは、異なる音量のオーディオ信号が不均一に混合されている場合に、サウンドインスタンスを明確に区別します。次に、クロスモダリティ蒸留を備えたクロスモーダルリファラーモジュールによって、聞こえるが画面外の音と静かであるが目に見えるオブジェクトの影響を消去します。定量的および定性的評価は、提案されたフレームワークが、特に実世界のシナリオの下で、音像定位タスクで優れた結果を達成することを示しています。コードはhttps://github.com/alvinliu0/Visual-Sound-Localization-in-the-Wildで入手できます。

The task of audio-visual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real-world scenarios, audios are usually contaminated by off-screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual-sound connections, making previous studies non-applicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio-Instance-Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off-screen sounds and the silent but visible objects by a Cross-modal Referrer module with cross-modality distillation. Quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior results on sound localization tasks, especially under real-world scenarios. Code is available at https://github.com/alvinliu0/Visual-Sound-Localization-in-the-Wild.

updated: Sun Feb 13 2022 21:06:19 GMT+0000 (UTC)

published: Sun Feb 13 2022 21:06:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト