Past and Future Motion Guided Network for Audio Visual Event Localization

Tingxiu Chen; Jianqin Yin; Jin Tang

視聴覚イベントのローカリゼーションのための過去と未来のモーションガイドネットワーク

近年、視聴覚イベントのローカリゼーションが大きな注目を集めています。オーディオビジュアルイベントを含むセグメントを検出し、トリミングされていないビデオからイベントカテゴリを認識することが目的です。既存の方法では、音声ガイドによる視覚的注意を使用して、進行中のイベントの空間領域にモデルが注意を向けるようにし、音声と視覚情報の相関関係に専念しますが、音声と空間運動の相関関係は無視します。過去と未来のモーションガイドネットワーク（PFAGN）に埋め込まれたビデオから視覚的なモーションをマイニングする過去と未来のモーション抽出（pf-ME）モジュールと、情報に焦点を当てることを実現するモーションガイドオーディオアテンション（MGAA）モジュールを提案します。過去と未来の視覚運動を通じたオーディオモダリティの興味深いイベントに関連しています。実験検証データセットとしてAVEを選択しました。実験では、監視された設定と監視されていない設定の両方で、この方法が最新技術よりも優れていることが示されています。

In recent years, audio-visual event localization has attracted much attention. It's purpose is to detect the segment containing audio-visual events and recognize the event category from untrimmed videos. Existing methods use audio-guided visual attention to lead the model pay attention to the spatial area of the ongoing event, devoting to the correlation between audio and visual information but ignoring the correlation between audio and spatial motion. We propose a past and future motion extraction (pf-ME) module to mine the visual motion from videos ,embedded into the past and future motion guided network (PFAGN), and motion guided audio attention (MGAA) module to achieve focusing on the information related to interesting events in audio modality through the past and future visual motion. We choose AVE as the experimental verification dataset and the experiments show that our method outperforms the state-of-the-arts in both supervised and weakly-supervised settings.

updated: Sun May 08 2022 07:26:43 GMT+0000 (UTC)

published: Sun May 08 2022 07:26:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト