FER-former: Multi-modal Transformer for Facial Expression Recognition

Yande Li; Mingjie Wang; Minglun Gong; Yonggang Lu; Li Liu

FER-former: 表情認識のためのマルチモーダル変換器

バーチャルリアリティでの直感的なインタラクションに対する需要の高まりは、表情認識 (FER) の分野でブームを巻き起こしています。既存のアプローチの制限 (たとえば、狭い受容野と均一な監視信号) に対処し、FER ツールの能力をさらに強化するために、この論文では、FER 用の新しい多様な監視ステアリング Transformer が提案されています。 FERフォーマーと呼ばれる私たちのアプローチは、多粒度埋め込み統合、ハイブリッド自己注意スキーム、および異種ドメインステアリング監視を特徴としています。具体的には、一般的な CNN とトランスフォーマーによって提供される機能の組み合わせのメリットを深く掘り下げるために、ハイブリッドステムは 2 種類の学習パラダイムを同時にカスケードするように設計されています。ここで、FER 固有のトランスフォーマーメカニズムは、最終的な分類のために、従来のハードワンホットラベルフォーカシングと CLIP ベースのテキスト指向トークンを並行して特徴付けるために考案されています。注釈のあいまいさの問題を緩和するために、画像の特徴とテキストの特徴の間の類似性を監視することにより、画像の特徴にもテキスト空間のセマンティックな相関関係を持たせる異種ドメインステアリング監視モジュールが提案されています。多種多様なトークンヘッドのコラボレーションに加えて、マルチモーダルなセマンティックキューを備えた多様なグローバル受容フィールドがキャプチャされ、優れた学習機能が提供されます。一般的なベンチマークでの広範な実験により、提案されたFERフォーマーが既存の最先端技術よりも優れていることが実証されています。

The ever-increasing demands for intuitive interactions in Virtual Reality has triggered a boom in the realm of Facial Expression Recognition (FER). To address the limitations in existing approaches (e.g., narrow receptive fields and homogenous supervisory signals) and further cement the capacity of FER tools, a novel multifarious supervision-steering Transformer for FER in the wild is proposed in this paper. Referred as FER-former, our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision. In specific, to dig deep into the merits of the combination of features provided by prevailing CNNs and Transformers, a hybrid stem is designed to cascade two types of learning paradigms simultaneously. Wherein, a FER-specific transformer mechanism is devised to characterize conventional hard one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. To ease the issue of annotation ambiguity, a heterogeneous domains-steering supervision module is proposed to make image features also have text-space semantic correlations by supervising the similarity between image features and text features. On top of the collaboration of multifarious token heads, diverse global receptive fields with multi-modal semantic cues are captured, thereby delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.

updated: Thu Mar 23 2023 02:29:53 GMT+0000 (UTC)

published: Thu Mar 23 2023 02:29:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト