Multi-Modal Facial Expression Recognition with Transformer-Based Fusion Networks and Dynamic Sampling

Jun-Hwa Kim; Namho Kim; Chee Sun Won

Transformer ベースの Fusion Network と動的サンプリングによるマルチモーダル表情認識

表情認識は、感情検出、メンタルヘルス分析、ヒューマンマシンインタラクションなど、さまざまな目的で重要です。表情認識では、静止画像とともに音声情報を組み込むことで、表情の状態をより包括的に理解することができます。この論文では、CVPR 2023 での野外における感情行動 (ABAW) チャレンジのためのマルチモーダル表情認識方法を紹介します。オーディオビジュアル情報を融合するためのモーダルフュージョンモジュール (MFM) を提案します。使用するモダリティは画像と音声で、Swin Transformer に基づいて特徴を抽出し、MFM を転送します。また、私たちのアプローチは、トレーニングデータセットのデータリサンプリングを通じてデータセットの不均衡に対処し、動的データサンプリングを使用して単一フレームで豊富なモーダルを活用することで、パフォーマンスを向上させます。

Facial expression recognition is important for various purpose such as emotion detection, mental health analysis, and human-machine interaction. In facial expression recognition, incorporating audio information along with still images can provide a more comprehensive understanding of an expression state. This paper presents the Multi-modal facial expression recognition methods for Affective Behavior in-the-wild (ABAW) challenge at CVPR 2023. We propose a Modal Fusion Module (MFM) to fuse audio-visual information. The modalities used are image and audio, and features are extracted based on Swin Transformer to forward the MFM. Our approach also addresses imbalances in the dataset through data resampling in training dataset and leverages the rich modal in a single frame using dynmaic data sampling, leading to improved performance.

updated: Wed Mar 15 2023 07:40:28 GMT+0000 (UTC)

published: Wed Mar 15 2023 07:40:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト