M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

Junke Wang; Zuxuan Wu; Wenhao Ouyang; Xintong Han; Jingjing Chen; Ser-Nam Lim; Yu-Gang Jiang

M2TR：ディープフェイク検出用のマルチモーダルマルチスケールトランスフォーマー

ディープフェイクの普及には、知覚的に説得力のある偽造画像を検出できる効果的なアプローチが必要です。この論文では、トランスモデルを使用して、さまざまなスケールで微妙な操作のアーティファクトをキャプチャすることを目指しています。特に、マルチモーダルマルチスケールトランスフォーマー（M2TR）を紹介します。これは、さまざまなサイズのパッチを操作して、さまざまな空間レベルで画像の局所的な不整合を検出します。 M2TRはさらに、慎重に設計されたクロスモダリティフュージョンブロックを介してRGB情報を補完するために、周波数領域で偽造アーティファクトを検出することを学習します。さらに、ディープフェイクの検出研究を促進するために、最先端のフェイススワッピングと顔の再現方法によって生成された4,000本のディープフェイクビデオで構成される高品質のディープフェイクデータセットSR-DFを紹介します。提案された方法の有効性を検証するために広範な実験を実施します。これは、最先端のディープフェイク検出方法を明確なマージンで上回っています。

The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins.

updated: Tue Apr 19 2022 06:08:33 GMT+0000 (UTC)

published: Tue Apr 20 2021 05:43:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト