Investigating Modality Bias in Audio Visual Video Parsing

Piyush Singh Pasi; Shubham Nemani; Preethi Jyothi; Ganesh Ramakrishnan

オーディオビジュアルビデオ解析におけるモダリティバイアスの調査

時間的境界を持つオーディオおよびビジュアルイベントラベルの検出を含むオーディオビジュアルビデオ解析（AVVP）の問題に焦点を当てます。各ビデオのラベルのバッグとして利用できるのはイベントラベルのみであるため、このタスクは特に困難です。 AVVPの既存の最先端モデルは、ハイブリッドアテンションネットワーク（HAN）を使用して、オーディオとビジュアルの両方のモダリティのクロスモーダル機能と、予測されるオーディオとビジュアルのセグメントレベルのイベント確率を集約する注意深いプーリングモジュールを使用します。ビデオレベルのイベント確率を生成します。予測中にモダリティが完全に無視される、既存のHANアーキテクチャのモダリティバイアスの詳細な分析を提供します。また、既存のHANと比較して、セグメントレベルとイベントレベルの両方で、視覚および視聴覚イベントのFスコアの絶対ゲインが約2％および1.6％になるHANの機能集約のバリアントを提案します。モデル。

We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries. The task is especially challenging since it is weakly supervised with only event labels available as a bag of labels for each video. An existing state-of-the-art model for AVVP uses a hybrid attention network (HAN) to generate cross-modal features for both audio and visual modalities, and an attentive pooling module that aggregates predicted audio and visual segment-level event probabilities to yield video-level event probabilities. We provide a detailed analysis of modality bias in the existing HAN architecture, where a modality is completely ignored during prediction. We also propose a variant of feature aggregation in HAN that leads to an absolute gain in F-scores of about 2% and 1.6% for visual and audio-visual events at both segment-level and event-level, in comparison to the existing HAN model.

updated: Thu Mar 31 2022 07:43:01 GMT+0000 (UTC)

published: Thu Mar 31 2022 07:43:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト