Where and When: Space-Time Attention for Audio-Visual Explanations

Yanbei Chen; Thomas Hummel; A. Sophia Koepke; Zeynep Akata

いつどこで：視聴覚説明のための時空の注意

マルチモーダル意思決定者の決定を説明するには、両方のモダリティからの証拠を決定する必要があります。 XAIの最近の進歩は、静止画像でトレーニングされたモデルの説明を提供します。ただし、動的な世界で複数の感覚モダリティをモデル化することになると、複雑なマルチモーダルモデルの神秘的なダイナミクスを解明する方法は未踏のままです。この作品では、私たちは重要な一歩を踏み出し、視聴覚認識のための学習可能な説明を探求します。具体的には、空間と時間の両方にわたるオーディオとビジュアルデータの相乗的なダイナミクスを明らかにする新しい時空間注意ネットワークを提案します。私たちのモデルは、関連する視覚的な手がかりが現れる場所と、予測された音がビデオでいつ発生するかをローカライズすることによってその決定を正当化しながら、オーディオビジュアルビデオイベントを予測することができます。モデルを3つのオーディオビジュアルビデオイベントデータセットでベンチマークし、最近の複数のマルチモーダル表現学習者および固有の説明モデルと広範囲に比較します。実験結果は、オーディオビジュアルビデオイベント認識に関する既存の方法よりも明らかに優れたモデルのパフォーマンスを示しています。さらに、摂動テストと人間の注釈を使用したポインティングゲームによるロバスト性分析に基づいて、モデルの説明可能性を分析するための詳細な調査を実施します。

Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a crucial step forward and explore learnable explanations for audio-visual recognition. Specifically, we propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time. Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear, and when the predicted sounds occur in videos. We benchmark our model on three audio-visual video event datasets, comparing extensively to multiple recent multi-modal representation learners and intrinsic explanation models. Experimental results demonstrate the clear superior performance of our model over the existing methods on audio-visual video event recognition. Moreover, we conduct an in-depth study to analyze the explainability of our model based on robustness analysis via perturbation tests and pointing games using human annotations.

updated: Tue May 04 2021 14:16:55 GMT+0000 (UTC)

published: Tue May 04 2021 14:16:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト