Transformer-based Multimodal Information Fusion for Facial Expression Analysis

Wei Zhang; Zhimeng Zhang; Feng Qiu; Suzhen Wang; Bowen Ma; Hao Zeng; Rudong An; Yu Ding

顔の表情分析のためのトランスフォーマーベースのマルチモーダル情報融合

顔の表情の分析は、コンピュータービジョンの分野で重要な研究課題となっています。ディープラーニング技術と大規模な野生の注釈付きデータセットの最近の開発により、顔の表情の分析は現在、現実世界の設定での課題に向けられています。この論文では、表現分類、行動単位検出、感情覚醒推定、およびマルチタスク学習を含む4つの競争タスクを定義するCVPR2022野生の感情行動分析（ABAW）への提出を紹介します。利用可能なマルチモーダル情報は、話し言葉、音声韻律、およびビデオでの視覚的表現で構成されます。私たちの仕事は、上記のマルチモーダル情報の融合を作成するために、4つの統合されたトランスベースのネットワークフレームワークを提案します。公式のAff-Wild2データセットの予備的な結果が報告され、提案された方法の有効性が実証されています。

Facial expression analysis has been a crucial research problem in the computer vision area. With the recent development of deep learning techniques and large-scale in-the-wild annotated datasets, facial expression analysis is now aimed at challenges in real world settings. In this paper, we introduce our submission to CVPR2022 Competition on Affective Behavior Analysis in-the-wild (ABAW) that defines four competition tasks, including expression classification, action unit detection, valence-arousal estimation, and a multi-task-learning. The available multimodal information consist of spoken words, speech prosody, and visual expression in videos. Our work proposes four unified transformer-based network frameworks to create the fusion of the above multimodal information. The preliminary results on the official Aff-Wild2 dataset are reported and demonstrate the effectiveness of our proposed method.

updated: Wed Mar 23 2022 12:38:50 GMT+0000 (UTC)

published: Wed Mar 23 2022 12:38:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト