FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing

Ajian Liu; Zichang Tan; Zitong Yu; Chenxu Zhao; Jun Wan; Yanyan Liang; Zhen Lei; Du Zhang; Stan Z. Li; Guodong Guo

FM-ViT: 顔のなりすまし防止のための柔軟なモーダルビジョントランスフォーマー

便利なマルチモーダル (つまり、RGB-D) センサーが利用できるようになったことで、顔のなりすまし防止に関する研究が急増しています。ただし、現在のマルチモーダル顔プレゼンテーション攻撃検出 (PAD) には 2 つの欠点があります。(1) マルチモーダルフュージョンに基づくフレームワークでは、トレーニング入力と一致するモダリティを提供する必要があり、展開シナリオが大幅に制限されます。 (2) 忠実度の高いデータセットでの ConvNet ベースのモデルのパフォーマンスは、ますます制限されています。この作業では、フレキシブルモーダルビジョントランスフォーマー (FM-ViT) と呼ばれる純粋なトランスフォーマーベースのフレームワークを提示します。 -モーダルデータ。具体的には、FM-ViT はモダリティごとに特定のブランチを保持してさまざまなモーダル情報を取得し、多頭相互注意 (MMA) と融合注意 ( MFA) は、各モーダルブランチをガイドして、有益なパッチトークンから潜在的な機能をマイニングし、独自の CLS トークンのモーダル情報を充実させることで、モダリティに依存しない活性機能を学習します。実験は、FM-ViT に基づいてトレーニングされた単一モデルが、さまざまなモーダルサンプルを柔軟に評価できるだけでなく、既存の単一モーダルフレームワークを大幅に上回り、より小さな FLOP とモデルパラメーターで導入されたマルチモーダルフレームワークに近づくことを示しています。

The availability of handy multi-modal (i.e., RGB-D) sensors has brought about a surge of face anti-spoofing research. However, the current multi-modal face presentation attack detection (PAD) has two defects: (1) The framework based on multi-modal fusion requires providing modalities consistent with the training input, which seriously limits the deployment scenario. (2) The performance of ConvNet-based model on high fidelity datasets is increasingly limited. In this work, we present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT), for face anti-spoofing to flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data. Specifically, FM-ViT retains a specific branch for each modality to capture different modal information and introduces the Cross-Modal Transformer Block (CMTB), which consists of two cascaded attentions named Multi-headed Mutual-Attention (MMA) and Fusion-Attention (MFA) to guide each modal branch to mine potential features from informative patch tokens, and to learn modality-agnostic liveness features by enriching the modal information of own CLS token, respectively. Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin, and approaches the multi-modal frameworks introduced with smaller FLOPs and model parameters.

updated: Fri May 05 2023 04:28:48 GMT+0000 (UTC)

published: Fri May 05 2023 04:28:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト