M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

Junke Wang; Zuxuan Wu; Jingjing Chen; Yu-Gang Jiang

M2TR：ディープフェイク検出用のマルチモーダルマルチスケールトランスフォーマー

ディープフェイク技術によって生成された偽造画像の広範な普及は、デジタル情報の信頼性に深刻な脅威をもたらしました。これには、高度な操作技術によって生成された知覚的に説得力のあるディープフェイクを検出できる効果的なアプローチが必要です。ほとんどの既存のアプローチは、異なるピクセル間の一貫性をキャプチャすることなく、入力画像をバイナリ予測にマッピングすることにより、ディープニューラルネットワークでディープフェイクと戦います。このホワイトペーパーでは、Deepfakeを検出するために、さまざまなスケールで微妙な操作アーティファクトをキャプチャすることを目的としています。これは、コンピュータービジョンのさまざまな認識タスクで、ピクセル間の依存関係をモデル化する際に優れたパフォーマンスを最近実証したトランスフォーマーモデルで実現されています。特に、マルチモーダルマルチスケールトランスフォーマー（M2TR）を紹介します。これは、さまざまなサイズのパッチで動作するマルチスケールトランスを使用して、さまざまな空間レベルでの局所的な不整合を検出します。検出結果を改善し、画像圧縮に対するメソッドの堅牢性を強化するために、M2TRは周波数情報も取得します。これは、クロスモダリティフュージョンモジュールを使用してRGB機能とさらに組み合わされます。 Deepfakeの検出方法の開発と評価には、大規模なデータセットが必要です。ただし、既存のベンチマークのサンプルには深刻なアーティファクトが含まれており、多様性に欠けていることがわかります。これにより、最先端の顔スワッピングと顔の再現方法によって生成された4,000本のDeepFakeビデオで構成される高品質のDeepfakeデータセットSR-DFを導入することになりました。 3つのDeepfakeデータセットで、最先端のDeepfake検出方法よりも優れた、提案された方法の有効性を検証するための広範な実験を実施します。

The widespread dissemination of forged images generated by Deepfake techniques has posed a serious threat to the trustworthiness of digital information. This demands effective approaches that can detect perceptually convincing Deepfakes generated by advanced manipulation techniques. Most existing approaches combat Deepfakes with deep neural networks by mapping the input image to a binary prediction without capturing the consistency among different pixels. In this paper, we aim to capture the subtle manipulation artifacts at different scales for Deepfake detection. We achieve this with transformer models, which have recently demonstrated superior performance in modeling dependencies between pixels for a variety of recognition tasks in computer vision. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which uses a multi-scale transformer that operates on patches of different sizes to detect the local inconsistency at different spatial levels. To improve the detection results and enhance the robustness of our method to image compression, M2TR also takes frequency information, which is further combined with RGB features using a cross modality fusion module. Developing and evaluating Deepfake detection methods requires large-scale datasets. However, we observe that samples in existing benchmarks contain severe artifacts and lack diversity. This motivates us to introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. On three Deepfake datasets, we conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods.

updated: Wed Apr 21 2021 12:59:29 GMT+0000 (UTC)

published: Tue Apr 20 2021 05:43:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト