Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for Enhanced Video Forgery Detection

Sayantan Das; Mojtaba Kolahdouzi; Levent Özparlak; Will Hickie; Ali Etemad

ディープフェイクのマスク解除: 強化されたビデオ偽造検出のためのマスクされた自動エンコーディング時空間トランスフォーマー

自己監視型マスク自動エンコーディング設定によって事前にトレーニングされた 1 対のビジョントランスフォーマーを使用して、ディープフェイクビデオを検出するための新しいアプローチを紹介します。私たちの方法は 2 つの異なるコンポーネントで構成されており、1 つはビデオの個々の RGB フレームから空間情報を学習することに重点を置き、もう 1 つは連続したフレームから生成されたオプティカルフローフィールドから時間的一貫性情報を学習します。一般的な大規模な画像コーパスに対して事前トレーニングが実行されるほとんどのアプローチとは異なり、より小さな顔関連データセット、つまり Celeb-A (空間学習コンポーネント用) と YouTube Faces (時間学習コンポーネント用) で事前トレーニングすることによって、成分)、強力な結果が得られます。一般的に使用されるデータセット、つまり FaceForensics++ (低品質と高品質、および非常に低品質という新しい高度に圧縮されたバージョン) および Celeb-DFv2 データセットでメソッドのパフォーマンスを評価するために、さまざまな実験を実行します。私たちの実験は、私たちの方法が FaceForensics++ (LQ、HQ、および VLQ) で新しい最先端を確立し、Celeb-DFv2 で競合する結果が得られることを示しています。さらに、私たちのメソッドは、FaceForensics++ でモデルを微調整し、CelebDFv2 でテストするクロスデータセット設定において、この分野の他のメソッドよりも優れており、その強力なクロスデータセット一般化能力を示しています。

We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup. Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames. Unlike most approaches where pre-training is performed on a generic large corpus of images, we show that by pre-training on smaller face-related datasets, namely Celeb-A (for the spatial learning component) and YouTube Faces (for the temporal learning component), strong results can be obtained. We perform various experiments to evaluate the performance of our method on commonly used datasets namely FaceForensics++ (Low Quality and High Quality, along with a new highly compressed version named Very Low Quality) and Celeb-DFv2 datasets. Our experiments show that our method sets a new state-of-the-art on FaceForensics++ (LQ, HQ, and VLQ), and obtains competitive results on Celeb-DFv2. Moreover, our method outperforms other methods in the area in a cross-dataset setup where we fine-tune our model on FaceForensics++ and test on CelebDFv2, pointing to its strong cross-dataset generalization ability.

updated: Fri Feb 09 2024 12:25:03 GMT+0000 (UTC)

published: Mon Jun 12 2023 05:49:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト