Multi-Modulation Network for Audio-Visual Event Localization

Hao Wang; Zheng-Jun Zha; Liang Li; Xuejin Chen; Jiebo Luo

視聴覚イベントのローカリゼーションのためのマルチ変調ネットワーク

私たちは、ビデオで聞こえると同時に見えるオーディオビジュアルイベントをローカライズする問題を研究しています。既存の作品は、2つのモダリティのセグメント間およびマルチスケールイベント提案間の有益な相関関係を無視しながら、セグメントレベルでオーディオおよびビジュアル機能をエンコードおよび調整することに焦点を当てています。上記の相関関係を学習し、関連する聴覚、視覚、および融合機能を変調するためのセマンティックガイダンスとして活用するために、新しいMultiModulation Network（M2N）を提案します。特に、特徴の符号化中に、クロスモーダル正規化とイントラモーダル正規化を提案します。前者は、クロスモーダル関係を確立して活用することにより、2つのモダリティの機能を調整します。後者は、同じモダリティのイベント関連のセマンティックガイダンスを使用して、単一のモダリティの機能を調整します。融合段階では、マルチスケール提案変調モジュールとマルチアライメントセグメント変調モジュールを提案して、マルチスケールイベント提案を導入し、クロスモーダルセグメント間の密なマッチングを可能にします。 M2Nは、オーディオビジュアルイベントに関する相関情報によって変調された聴覚、視覚、および融合機能を使用して、正確なイベントローカリゼーションを実行します。 AVEデータセットで実施された広範な実験は、提案された方法が、教師ありイベントのローカリゼーションとクロスモダリティのローカリゼーションの両方で最先端の方法よりも優れていることを示しています。

We study the problem of localizing audio-visual events that are both audible and visible in a video. Existing works focus on encoding and aligning audio and visual features at the segment level while neglecting informative correlation between segments of the two modalities and between multi-scale event proposals. We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance to modulate the related auditory, visual, and fused features. In particular, during feature encoding, we propose cross-modal normalization and intra-modal normalization. The former modulates the features of two modalities by establishing and exploiting the cross-modal relationship. The latter modulates the features of a single modality with the event-relevant semantic guidance of the same modality. In the fusion stage,we propose a multi-scale proposal modulating module and a multi-alignment segment modulating module to introduce multi-scale event proposals and enable dense matching between cross-modal segments. With the auditory, visual, and fused features modulated by the correlation information regarding audio-visual events, M2N performs accurate event localization. Extensive experiments conducted on the AVE dataset demonstrate that our proposed method outperforms the state of the art in both supervised event localization and cross-modality localization.

updated: Mon Aug 30 2021 13:11:02 GMT+0000 (UTC)

published: Thu Aug 26 2021 13:11:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト