Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers

Xiang Zhang; Lijun Yin

マルチヘッド融合変圧器に基づくAU検出のためのマルチモーダル学習

マルチモーダル学習は、特に顔の分析とアクションユニットの検出のアプリケーションで近年強化されていますが、1）表現に関連する特徴学習と2）マルチモダリティの効率的な融合という点で2つの主要な課題がまだ存在します。最近、AU検出に注意メカニズムを利用する効果を示した研究が数多くありますが、それらのほとんどは関心領域（ROI）を機能と結び付けていますが、各AUの機能間で注意を向けることはめったにありません。一方、より効率的な自己注意メカニズムを利用するトランスフォーマーは、自然言語処理やコンピュータービジョンのタスクで広く使用されていますが、AU検出タスクでは十分に検討されていません。この論文では、AU検出のための新しいエンドツーエンドのマルチヘッド融合変圧器（MFT）法を提案します。これは、変圧器エンコーダーによってさまざまなモダリティからAUエンコーディング機能の表現を学習し、別の融合変圧器モジュールによってモダリティを融合します。マルチヘッドフュージョンアテンションは、複数のモダリティの効果的なフュージョンのためにフュージョントランスモジュールで設計されています。私たちのアプローチは、2つのパブリックマルチモーダルAUデータベース、BP4D、およびBP4D +で評価され、その結果は、最先端のアルゴリズムおよびベースラインモデルよりも優れています。さらに、さまざまなモダリティからのAU検出のパフォーマンスを分析します。

Multi-modal learning has been intensified in recent years, especially for applications in facial analysis and action unit detection whilst there still exist two main challenges in terms of 1) relevant feature learning for representation and 2) efficient fusion for multi-modalities. Recently, there are a number of works have shown the effectiveness in utilizing the attention mechanism for AU detection, however, most of them are binding the region of interest (ROI) with features but rarely apply attention between features of each AU. On the other hand, the transformer, which utilizes a more efficient self-attention mechanism, has been widely used in natural language processing and computer vision tasks but is not fully explored in AU detection tasks. In this paper, we propose a novel end-to-end Multi-Head Fused Transformer (MFT) method for AU detection, which learns AU encoding features representation from different modalities by transformer encoder and fuses modalities by another fusion transformer module. Multi-head fusion attention is designed in the fusion transformer module for the effective fusion of multiple modalities. Our approach is evaluated on two public multi-modal AU databases, BP4D, and BP4D+, and the results are superior to the state-of-the-art algorithms and baseline models. We further analyze the performance of AU detection from different modalities.

updated: Tue Mar 22 2022 03:31:29 GMT+0000 (UTC)

published: Tue Mar 22 2022 03:31:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト