AU-Aware Vision Transformers for Biased Facial Expression Recognition

Shuyi Mao; Xinpeng Li; Qingyang Wu; Xiaojiang Peng

偏った表情認識のための AU-Aware ビジョントランスフォーマー

ドメインバイアスとラベルバイアスが異なる顔表情認識 (FER) データセットに存在することが調査で証明されており、他のデータセットを追加して特定のデータセットのパフォーマンスを向上させることは困難です。 FER バイアスの問題については、最近の研究は主に、高度なドメイン適応アルゴリズムを使用したクロスドメインの問題に焦点を当てています。このホワイトペーパーでは、クロスドメインデータセットを活用して FER のパフォーマンスを向上させる方法という別の問題に対処します。粗くて偏った表現ラベルとは異なり、顔のアクションユニット (AU) はきめ細かく、心理学の研究によって示唆された客観的です。これに動機付けられて、パフォーマンスを向上させるためにさまざまなFERデータセットのAU情報に頼り、次のように貢献します。まず、複数の FER データセットの単純な共同トレーニングが個々のデータセットの FER パフォーマンスに有害であることを実験的に示します。さらに、式固有の平均画像と AU コサイン距離を導入して、FER データセットのバイアスを測定します。この新しい測定は、関節トレーニングの実験的劣化と一致する結論を示しています。次に、シンプルでありながら概念的に新しいフレームワーク、AU-aware Vision Transformer (AU-ViT) を提案します。 AU または疑似 AU ラベルを使用して補助データセットを共同でトレーニングすることにより、個々のデータセットのパフォーマンスを向上させます。また、AU-ViT は現実世界のオクルージョンに対して堅牢であることもわかりました。さらに、慎重に初期化された ViT が、高度なディープ畳み込みネットワークに匹敵するパフォーマンスを達成することを初めて証明しました。当社の AU-ViT は、3 つの一般的なデータセット (RAF-DB で 91.10%、AffectNet で 65.59%、FERPlus で 90.15%) で最先端のパフォーマンスを達成しています。コードとモデルはまもなくリリースされます。

Studies have proven that domain bias and label bias exist in different Facial Expression Recognition (FER) datasets, making it hard to improve the performance of a specific dataset by adding other datasets. For the FER bias issue, recent researches mainly focus on the cross-domain issue with advanced domain adaption algorithms. This paper addresses another problem: how to boost FER performance by leveraging cross-domain datasets. Unlike the coarse and biased expression label, the facial Action Unit (AU) is fine-grained and objective suggested by psychological studies. Motivated by this, we resort to the AU information of different FER datasets for performance boosting and make contributions as follows. First, we experimentally show that the naive joint training of multiple FER datasets is harmful to the FER performance of individual datasets. We further introduce expression-specific mean images and AU cosine distances to measure FER dataset bias. This novel measurement shows consistent conclusions with experimental degradation of joint training. Second, we propose a simple yet conceptually-new framework, AU-aware Vision Transformer (AU-ViT). It improves the performance of individual datasets by jointly training auxiliary datasets with AU or pseudo-AU labels. We also find that the AU-ViT is robust to real-world occlusions. Moreover, for the first time, we prove that a carefully-initialized ViT achieves comparable performance to advanced deep convolutional networks. Our AU-ViT achieves state-of-the-art performance on three popular datasets, namely 91.10% on RAF-DB, 65.59% on AffectNet, and 90.15% on FERPlus. The code and models will be released soon.

updated: Sat Nov 12 2022 08:58:54 GMT+0000 (UTC)

published: Sat Nov 12 2022 08:58:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト