FlowGrad: Using Motion for Visual Sound Source Localization

Rajsuryan Singh; Pablo Zinemanas; Xavier Serra; Juan Pablo Bello; Magdalena Fuentes

FlowGrad: 視覚的な音源定位にモーションを使用

視覚的な音源定位に関する最新の研究は、自己管理型の方法で学習されたセマンティックなオーディオビジュアル表現に依存しており、ビデオに存在する時間情報を設計上除外しています。広く使用されているベンチマークデータセットには効果的であることが証明されていますが、この方法は都市交通のような困難なシナリオには不十分です。この作品は、モーション情報をエンコードする手段としてオプティカルフローを使用して、都市シーンにおける音源定位のための最先端の方法に時間的コンテキストを導入します。私たちの方法の長所と短所の分析は、視覚的な音源定位の問題をよりよく理解するのに役立ち、視聴覚シーンを理解するための未解決の課題に光を当てます。

Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.

updated: Tue Nov 15 2022 18:12:10 GMT+0000 (UTC)

published: Tue Nov 15 2022 18:12:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト