Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal Fusion

Baptiste Pouthier; Laurent Pilati; Leela K. Gudupudi; Charles Bouveyron; Frederic Precioso

不確実性に基づくマルチモーダルフュージョンによる多目的最適化としてのアクティブスピーカー検出

現在、さまざまな研究から、アクティブなスピーカーを検出する際にビデオデータとオーディオデータを組み合わせると大きな利点があることが十分に確立されています。ただし、いずれのモダリティも、信頼できない情報や欺 de的な情報を誘導することにより、オーディオビジュアルフュージョンを誤解させる可能性があります。このペーパーでは、新しい自己注意、不確実性ベースのマルチモーダル融合スキームを使用して、各モダリティを最大限に活用するための多目的学習問題としてのアクティブスピーカー検出の概要を説明します。得られた結果は、提案された多目的学習アーキテクチャが、mAP と AUC の両方のスコアを改善するという点で、従来のアプローチよりも優れていることを示しています。さらに、私たちの融合戦略は、アクティブスピーカーの検出において、さまざまな分野で報告されている他のモダリティ融合手法を凌駕することを示しています。最後に、提案された方法が AVA-ActiveSpeaker データセットの最先端を大幅に改善することを示します。

It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.

updated: Mon Jun 07 2021 17:38:55 GMT+0000 (UTC)

published: Mon Jun 07 2021 17:38:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト