Leveraging TCN and Transformer for effective visual-audio fusion in continuous emotion recognition

Weiwei Zhou; Jiada Lu; Zhaolong Xiong; Weifeng Wang

TCN と Transformer を活用して、連続的な感情認識における効果的なビジュアルとオーディオの融合を実現

人間の感情認識は、人間とコンピューターの相互作用において重要な役割を果たします。このホワイトペーパーでは、第 5 回ワークショップの Valence-Arousal (VA) 推定チャレンジ、Expression (Expr) Classification Challenge、および Action Unit (AU) Detection Challenge へのアプローチと、インザワイルドの感情行動分析に関するコンペティションを紹介します ( ABAW)。具体的には、Temporal Convolutional Networks (TCN) と Transformer を活用して連続的な感情認識のパフォーマンスを向上させる、新しいマルチモーダルフュージョンモデルを提案します。私たちのモデルは、視覚情報と聴覚情報を効果的に統合して、感情認識の精度を向上させることを目的としています。私たちのモデルはベースラインを上回り、Expression Classification チャレンジで 3 位にランクされました。

Human emotion recognition plays an important role in human-computer interaction. In this paper, we present our approach to the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge of the 5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Specifically, we propose a novel multi-modal fusion model that leverages Temporal Convolutional Networks (TCN) and Transformer to enhance the performance of continuous emotion recognition. Our model aims to effectively integrate visual and audio information for improved accuracy in recognizing emotions. Our model outperforms the baseline and ranks 3 in the Expression Classification challenge.

updated: Mon Apr 17 2023 11:30:07 GMT+0000 (UTC)

published: Wed Mar 15 2023 04:15:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト