POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition

Ce Zheng; Matias Mendieta; Chen Chen

ポスター：顔の表情を認識するためのピラミッドクロスフュージョントランスフォーマーネットワーク

顔の表情認識（FER）は、コンピュータービジョンコミュニティでますます関心を集めています。やりがいのある作業として、FERで特に蔓延している3つの重要な問題があります。それは、クラス間の類似性、クラス内の不一致、およびスケールの感度です。既存の方法は通常、これらの問題のいくつかに対処しますが、統一されたフレームワークでそれらすべてに取り組むわけではありません。したがって、この論文では、これらの問題を総合的に解決することを目的とした2ストリームのPyramid crOss-fuSion TransformERネットワーク（POSTER）を提案します。具体的には、顔のランドマークと直接画像の特徴の効果的なコラボレーションを可能にするトランスベースのクロスフュージョンパラダイムを設計して、顕著な顔の領域への適切な注意を最大化します。さらに、POSTERは、スケール不変性を促進するためにピラミッド構造を採用しています。広範な実験結果は、私たちのPOSTERがRAF-DBでSOTAメソッドをそれぞれ92.05％、FERPlusで91.62％、AffectNet（7 cls）で67.31％、AffectNet（8 cls）で63.34％優れていることを示しています。

Facial Expression Recognition (FER) has received increasing interest in the computer vision community. As a challenging task, there are three key issues especially prevalent in FER: inter-class similarity, intra-class discrepancy, and scale sensitivity. Existing methods typically address some of these issues, but do not tackle them all in a unified framework. Therefore, in this paper, we propose a two-stream Pyramid crOss-fuSion TransformER network (POSTER) that aims to holistically solve these issues. Specifically, we design a transformer-based cross-fusion paradigm that enables effective collaboration of facial landmark and direct image features to maximize proper attention to salient facial regions. Furthermore, POSTER employs a pyramid structure to promote scale invariance. Extensive experimental results demonstrate that our POSTER outperforms SOTA methods on RAF-DB with 92.05%, FERPlus with 91.62%, AffectNet (7 cls) with 67.31%, and AffectNet (8 cls) with 63.34%, respectively.

updated: Fri Apr 08 2022 14:01:41 GMT+0000 (UTC)

published: Fri Apr 08 2022 14:01:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト