Visual Prompt Flexible-Modal Face Anti-Spoofing

Zitong Yu; Rizhao Cai; Yawen Cui; Ajian Liu; Changsheng Chen

ビジュアルプロンプトフレキシブルモーダルフェイスなりすまし防止

最近、顔のなりすまし防止 (FAS) システムの堅牢性を向上させるために、ビジョントランスフォーマーベースのマルチモーダル学習方法が提案されています。ただし、現実世界から収集されたマルチモーダルな顔データは、さまざまなイメージングセンサーからのモダリティが欠落しているため、不完全であることがよくあります。最近、flexible-modal FAS~yu2023flexible がより注目を集めています。これは、完全なマルチモーダル顔データを使用して統合されたマルチモーダル FAS モデルを開発することを目的としていますが、テスト時に欠落しているモダリティの影響を受けません。この論文では、フレキシブルモーダル FAS における 1 つの主要な課題、つまり、現実世界の状況でトレーニングまたはテスト中にモダリティの欠落が発生した場合に取り組みます。言語モデルにおけるプロンプト学習の最近の成功に触発されて、私たちはビジュアルプロンプトフレキシブルモーダル FAS (VP-FAS) を提案します。これはモーダル関連のプロンプトを学習して、凍結された事前トレーニングされた基礎モデルを下流のフレキシブルモーダル FAS タスクに適応させます。。具体的には、バニラのビジュアルプロンプトと残りのコンテキストプロンプトの両方がマルチモーダルトランスフォーマーにプラグインされ、一般的なモダリティが欠落しているケースを処理しますが、モデル全体のトレーニングと比較して必要な学習可能なパラメーターは 4% 未満だけです。さらに、部分モダリティが欠落している場合にモデルに一貫したマルチモーダル特徴埋め込みを学習させるために、欠落モダリティ正則化が提案されています。 2 つのマルチモーダル FAS ベンチマークデータセットに対して行われた広範な実験により、大量のモデルの再トレーニングの必要性を軽減しながら、さまざまな欠落モダリティケースの下でパフォーマンスを向上させる VP-FAS フレームワークの有効性が実証されました。

Recently, vision transformer based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, multimodal face data collected from the real world is often imperfect due to missing modalities from various imaging sensors. Recently, flexible-modal FAS~yu2023flexible has attracted more attention, which aims to develop a unified multimodal FAS model using complete multimodal face data but is insensitive to test-time missing modalities. In this paper, we tackle one main challenge in flexible-modal FAS, i.e., when missing modality occurs either during training or testing in real-world situations. Inspired by the recent success of the prompt learning in language models, we propose Visual Prompt flexible-modal FAS (VP-FAS), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to downstream flexible-modal FAS task. Specifically, both vanilla visual prompts and residual contextual prompts are plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 4% learnable parameters compared to training the entire model. Furthermore, missing-modality regularization is proposed to force models to learn consistent multimodal feature embeddings when missing partial modalities. Extensive experiments conducted on two multimodal FAS benchmark datasets demonstrate the effectiveness of our VP-FAS framework that improves the performance under various missing-modality cases while alleviating the requirement of heavy model re-training.

updated: Wed Jul 26 2023 05:06:41 GMT+0000 (UTC)

published: Wed Jul 26 2023 05:06:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト