Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

R Gnana Praveen; Eric Granger; Patrick Cardinal

共同交差注意を用いた価覚醒空間における感情認識のための視聴覚融合

自動感情認識 (ER) は、多くの現実世界のアプリケーションでの可能性があるため、最近多くの関心を集めています。これに関連して、マルチモーダルアプローチは、多様で補完的な情報源を組み合わせることにより、（ユニモーダルアプローチよりも）パフォーマンスを向上させ、ノイズの多いモダリティと欠落したモダリティに対してある程度の堅牢性を提供することが示されています。この論文では、ビデオから抽出された顔と声のモダリティの融合に基づく次元 ER に焦点を当て、補完的なオーディオビジュアル (AV) 関係を調査して、価電子覚醒空間における個人の感情状態を予測します。ほとんどの最先端の融合技術は、AV モダリティの補完的な性質を効果的に活用していない反復ネットワークまたは従来の注意メカニズムに依存しています。この問題に対処するために、AV モダリティ全体で顕著な特徴を抽出する AV 融合のための共同相互注意モデルを導入します。これにより、モーダル内関係を維持しながら、モーダル間関係を効果的に活用できます。特に、共同特徴表現と個々のモダリティの特徴表現との間の相関に基づいて相互注意の重みを計算します。ジョイント AV 機能表現をクロスアテンションモジュールに展開することで、イントラモーダルとインターモーダルの両方の関係を同時に活用するのに役立ち、それによってバニラクロスアテンションモジュールよりもシステムのパフォーマンスが大幅に向上します。提案されたアプローチの有効性は、RECOLA および AffWild2 データセットからの挑戦的なビデオで実験的に検証されています。結果は、モダリティがうるさいまたは存在しない場合でも、共同のクロスアテンション AV 融合モデルが最先端のアプローチよりも優れた費用対効果の高いソリューションを提供することを示しています。

Automatic emotion recognition (ER) has recently gained lot of interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, that allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities. By deploying the joint A-V feature representation into the cross-attention module, it helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent.

updated: Mon Sep 19 2022 15:01:55 GMT+0000 (UTC)

published: Mon Sep 19 2022 15:01:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト