Multi Modal Facial Expression Recognition with Transformer-Based Fusion Networks and Dynamic Sampling

Jun-Hwa Kim; Namho Kim; Chee Sun Won

Transformer ベースの Fusion Network と動的サンプリングによるマルチモーダル表情認識

表情認識は、感情検出、メンタルヘルス分析、人間と機械の相互作用など、さまざまなアプリケーションにとって不可欠なタスクです。この論文では、音声情報と顔画像を利用して、あいまいな表情を区別するための重要な手がかりを提供するマルチモーダル表情認識方法を提案します。具体的には、モーダルフュージョンモジュール (MFM) を導入して、Swin Transformer から画像と音声の特徴が抽出されるオーディオビジュアル情報を融合します。さらに、動的データリサンプリングを採用することで、データセットの不均衡の問題に取り組みます。私たちのモデルは、CVPR 2023 の Affective Behavior in-the-wild (ABAW) チャレンジで評価されています。

Facial expression recognition is an essential task for various applications, including emotion detection, mental health analysis, and human-machine interactions. In this paper, we propose a multi-modal facial expression recognition method that exploits audio information along with facial images to provide a crucial clue to differentiate some ambiguous facial expressions. Specifically, we introduce a Modal Fusion Module (MFM) to fuse audio-visual information, where image and audio features are extracted from Swin Transformer. Additionally, we tackle the imbalance problem in the dataset by employing dynamic data resampling. Our model has been evaluated in the Affective Behavior in-the-wild (ABAW) challenge of CVPR 2023.

updated: Sun Mar 19 2023 04:47:43 GMT+0000 (UTC)

published: Wed Mar 15 2023 07:40:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト