MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

Davide Alessandro Coccomini; Giorgos Kordopatis Zilos; Giuseppe Amato; Roberto Caldelli; Fabrizio Falchi; Symeon Papadopoulos; Claudio Gennaro

MINTIME: マルチ ID サイズ不変ビデオディープフェイク検出

このホワイトペーパーでは、空間的および時間的な異常をキャプチャし、同じビデオ内の複数の人物のインスタンスと顔サイズのバリエーションを処理する、ビデオディープフェイク検出アプローチである MINTIME を紹介します。以前のアプローチでは、単純な事後集計スキーム、つまり平均または最大操作を使用するか、推論に 1 つの ID (最大のもの) のみを使用することによって、そのような情報を無視していました。それどころか、提案されたアプローチは、畳み込みニューラルネットワークバックボーンと組み合わせた時空間 TimeSformer に基づいて構築され、ビデオに描かれた複数のアイデンティティの顔シーケンスから時空間異常をキャプチャします。これは、マスキング操作に基づいて各顔シーケンスに個別に対応し、ビデオレベルの集約を容易にする Identity-aware Attention メカニズムによって実現されます。さらに、2 つの新しい埋め込みが採用されています: (i) 各顔シーケンスの時間情報をエンコードする時間コヒーレント位置埋め込みと、(ii) ビデオフレームサイズに対する比率として顔のサイズをエンコードするサイズ埋め込み。これらの拡張機能により、文献の他の方法では通常無視される複数の ID の情報を集約する方法を学習することで、システムを実際の環境で特にうまく適応させることができます。 ForgeryNet データセットで最先端の結果を達成し、複数の人が含まれるビデオで最大 14% の AUC を改善し、クロスフォージェリおよびクロスデータセットの設定で十分な一般化機能を示します。コードは、https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection で公開されています。

In this paper, we introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes. Previous approaches disregard such information either by using simple a-posteriori aggregation schemes, i.e., average or max operation, or using only one identity for the inference, i.e., the largest one. On the contrary, the proposed approach builds on a Spatio-Temporal TimeSformer combined with a Convolutional Neural Network backbone to capture spatio-temporal anomalies from the face sequences of multiple identities depicted in a video. This is achieved through an Identity-aware Attention mechanism that attends to each face sequence independently based on a masking operation and facilitates video-level aggregation. In addition, two novel embeddings are employed: (i) the Temporal Coherent Positional Embedding that encodes each face sequence's temporal information and (ii) the Size Embedding that encodes the size of the faces as a ratio to the video frame size. These extensions allow our system to adapt particularly well in the wild by learning how to aggregate information of multiple identities, which is usually disregarded by other methods in the literature. It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people and demonstrates ample generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection

updated: Sun Nov 20 2022 15:17:24 GMT+0000 (UTC)

published: Sun Nov 20 2022 15:17:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト