Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

R. Gnana Praveen; Eric Granger; Patrick Cardinal

次元感情認識のためのクロスアテンションオーディオビジュアルフュージョン

マルチモーダル分析は、孤立したユニモーダルアプローチよりも感情認識の全体的な精度を向上させることができるため、最近、感情コンピューティングに大きな関心を集めています。マルチモーダル感情認識の最も効果的な手法は、顔、声、生理学的モダリティなどの多様で補完的な情報源を効率的に活用して、包括的な特徴表現を提供します。この論文では、ビデオから抽出された顔と声のモダリティの融合に基づく次元の感情認識に焦点を当てます。ここでは、複雑な時空間関係がキャプチャされる可能性があります。既存の融合技術のほとんどは、オーディオビジュアル（AV）モダリティの補完的な性質を効果的に活用しないリカレントネットワークまたは従来の注意メカニズムに依存しています。クロスアテンションフュージョンアプローチを導入して、AVモダリティ全体の顕著な特徴を抽出し、価数と覚醒の連続値の正確な予測を可能にします。私たちの新しいクロスアテンションAV融合モデルは、インターモーダル関係を効率的に活用します。特に、クロスアテンションの重みを計算して、個々のモダリティ全体でより寄与している特徴に焦点を合わせ、それによって寄与している特徴の表現を組み合わせます。これは、価数と覚醒の予測のために完全に接続されたレイヤーに送られます。提案されたアプローチの有効性は、RECOLAおよび疲労（プライベート）データセットからのビデオで実験的に検証されます。結果は、私たちのクロスアテンションAV融合モデルが、最先端の融合アプローチよりも優れた費用効果の高いアプローチであることを示しています。コードが利用可能です：https：//github.com/praveena2j/Cross-Attentional-AV-Fusion

Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition over isolated uni-modal approaches. The most effective techniques for multimodal emotion recognition efficiently leverage diverse and complimentary sources of information, such as facial, vocal, and physiological modalities, to provide comprehensive feature representations. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos, where complex spatiotemporal relationships may be captured. Most of the existing fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complimentary nature of audio-visual (A-V) modalities. We introduce a cross-attentional fusion approach to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. Our new cross-attentional A-V fusion model efficiently leverages the inter-modal relationships. In particular, it computes cross-attention weights to focus on the more contributive features across individual modalities, and thereby combine contributive feature representations, which are then fed to fully connected layers for the prediction of valence and arousal. The effectiveness of the proposed approach is validated experimentally on videos from the RECOLA and Fatigue (private) data-sets. Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches. Code is available: https://github.com/praveena2j/Cross-Attentional-AV-Fusion

updated: Sat Jul 06 2024 14:47:18 GMT+0000 (UTC)

published: Tue Nov 09 2021 16:01:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト