Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Dennis Fedorishin; Deen Dayal Mohan; Bhavin Jawade; Srirangaraj Setlur; Venu Govindaraju

流れを聞く: オプティカルフローに基づく自己監視型の視覚音源定位

明示的な注釈なしでビデオ内の音源の位置を特定することを学ぶことは、視聴覚研究の新しい分野です。この分野の既存の研究では、アテンションマップを作成して 2 つのモダリティ間の相関関係を把握し、音源を特定することに重点が置かれています。ビデオでは、多くの場合、動きを示すオブジェクトは音を生成するオブジェクトです。この作業では、ビデオ内のオプティカルフローをモデル化することで、この特性をキャプチャし、音源の位置を特定するのに役立ちます。さらに、フローベースの注意を追加すると、視覚的な音源定位が大幅に改善されることを示しています。最後に、標準の音源定位データセットで手法のベンチマークを行い、Soundnet Flickr および VGG 音源データセットで最先端のパフォーマンスを達成します。コード: https://github.com/denfed/hearttheflow。

Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research. Existing work in this area focuses on creating attention maps to capture the correlation between the two modalities to localize the source of the sound. In a video, oftentimes, the objects exhibiting movement are the ones generating the sound. In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source. We further demonstrate that the addition of flow-based attention substantially improves visual sound source localization. Finally, we benchmark our method on standard sound source localization datasets and achieve state-of-the-art performance on the Soundnet Flickr and VGG Sound Source datasets. Code: https://github.com/denfed/heartheflow.

updated: Sun Nov 06 2022 03:48:45 GMT+0000 (UTC)

published: Sun Nov 06 2022 03:48:45 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト