Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Yung-Hsuan Lai; Yen-Chun Chen; Yu-Chiang Frank Wang

モダリティに依存しない教師が、弱く監視された視聴覚イベントパーサーと出会う

視聴覚学習はマルチモーダル機械学習の主要な柱であり、コミュニティは主にモダリティに合わせた設定に焦点を当てています。つまり、視聴覚モダリティの両方が予測ターゲットを通知すると想定されています。 Look、Listen、および Parse データセット (LLP) を使用して、十分に調査されていない非整列設定を調査します。この設定の目的は、弱いラベルのみが観察されたビデオ内のオーディオおよびビジュアルイベントを認識することです。このような弱いビデオレベルのラベルは、イベントが知覚されるモダリティ (オーディオ、ビジュアル、またはその両方) を知らずに、何が起こったのかを伝えるだけです。この困難な環境での学習を強化するために、モダリティ教師として大規模な対照的に事前トレーニングされたモデルを組み込みます。 Visual-Audio Label Elaboration (VALOR) と呼ばれる、シンプルで効果的かつ汎用的な方法が、トレーニングイベントのモダリティラベルを収集するために革新されました。実証研究によると、収集されたラベルは注意ベースラインを平均 F スコア (Type@AV) で 8.0 大幅に改善します。驚くべきことに、モダリティに依存しない教師は、他の潜在的に調整されていないモダリティからのノイズを防ぐため、モダリティに依存しない教師の方がモダリティと融合した教師よりも優れていることがわかりました。さらに、当社の最良のモデルは、LLP のすべてのメトリックにおいて大幅なマージンで新しい最先端を達成しています (Type@AV の F スコアは +5.4)。 VALOR はオーディオビジュアルイベントローカリゼーションにさらに一般化され、新しい最先端も実現します。コードは https://github.com/Franklin905/VALOR から入手できます。

Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well. Code is available at: https://github.com/Franklin905/VALOR.

updated: Sat May 27 2023 02:57:39 GMT+0000 (UTC)

published: Sat May 27 2023 02:57:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト