Combining EfficientNet and Vision Transformers for Video Deepfake Detection

Davide Coccomini; Nicola Messina; Claudio Gennaro; Fabrizio Falchi

EfficientNetとVisionTransformersを組み合わせてビデオディープフェイクを検出

ディープフェイクは、リアルでありながら偽の画像を偽造するためのデジタル操作の結果です。深い生成モデルの驚くべき進歩により、今日では、変分オートエンコーダー（VAE）または生成的敵対的ネットワーク（GAN）を使用して偽の画像またはビデオが取得されています。これらのテクノロジーは、よりアクセスしやすく正確になり、検出が非常に困難な偽のビデオになっています。従来、畳み込みニューラルネットワーク（CNN）は、ビデオディープフェイク検出を実行するために使用されており、EfficientNetB7に基づく方法を使用して最良の結果が得られました。この研究では、現実的な人間の顔の生成においてほとんどの方法が非常に正確になっていることを考慮して、顔のビデオディープフェイク検出に焦点を当てます。具体的には、さまざまなタイプのVision Transformerを、特徴抽出器として使用される畳み込みEfficientNet B0と組み合わせて、VisionTransformersを使用するいくつかのごく最近の方法と同等の結果を取得します。最先端のアプローチとは異なり、蒸留法もアンサンブル法も使用していません。さらに、同じビデオショットで複数の顔を処理するための単純な投票スキームに基づく簡単な推論手順を示します。最高のモデルは、0.951のAUCと88.0％のF1スコアを達成しました。これは、DeepFake検出チャレンジ（DFDC）の最先端に非常に近いものです。

Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC).

updated: Thu Jan 20 2022 14:35:11 GMT+0000 (UTC)

published: Tue Jul 06 2021 13:35:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト