Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

Komal Chugh; Parul Gupta; Abhinav Dhall; Ramanathan Subramanian

お互いのために作られていない-オーディオビジュアル不協和ベースのディープフェイク検出とローカリゼーション

モダリティ不協和音スコア（MDS）と呼ばれる、音声と視覚のモダリティ間の非類似性に基づくディープフェイクビデオの検出を提案します。どちらかのモダリティを操作すると、2つのモダリティの間で不調和が生じると仮定します。たとえば、リップシンクの喪失、不自然な顔や唇の動きなどです。MDSは、ビデオ。個別のモダリティのクロスエントロピー損失と、モダリティ間の類似性をモデル化する対照的な損失を使用して、チャンクごとにオーディオチャネルとビジュアルチャネルの識別機能を学習します。 DFDCおよびDeepFake-TIMITデータセットでの広範な実験は、私たちのアプローチが最新技術よりも最大7％優れていることを示しています。また、時間的偽造のローカリゼーションを示し、操作されたビデオセグメントを識別する方法を示します。

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

updated: Sat Mar 20 2021 15:09:49 GMT+0000 (UTC)

published: Fri May 29 2020 06:09:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト