Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Zhixi Cai; Kalin Stefanov; Abhinav Dhall; Munawar Hayat

あなたは本当にそれを意味しますか？一時的な偽造のローカリゼーションのためのコンテンツ駆動型オーディオビジュアルディープフェイクデータセットとマルチモーダル手法

社会的影響が大きいため、ディープフェイク検出はコンピュータビジョンコミュニティで活発な注目を集めています。ほとんどのディープフェイク検出方法は、コンテンツの意味を損なわずに、ビデオ全体またはランダムな場所でのアイデンティティ、顔の属性、および敵対的な摂動に基づく時空間的変更に依存しています。ただし、洗練されたディープフェイクには、ビデオ/オーディオ操作のごく一部しか含まれていない場合があります。これにより、たとえば、感情の観点からコンテンツの意味を完全に反転させることができます。このギャップに対処するために、一時的な偽造のローカリゼーションを学習するタスク用に明示的に設計された、ローカライズされたオーディオビジュアルディープフェイク（LAV-DF）と呼ばれるコンテンツ駆動型のオーディオビジュアルディープフェイクデータセットを紹介します。具体的には、ビデオ全体の感情の極性を変更するために、コンテンツ駆動型のオーディオビジュアル操作が戦略的な場所で実行されます。提案されたデータセットをベンチマークするためのベースライン方法は、境界認識時間的偽造検出（BA-TFD）と呼ばれる3DCNNモデルであり、対照的な境界マッチングおよびフレーム分類損失関数を介してガイドされます。私たちの広範な定量分析は、一時的な偽造のローカリゼーションとディープフェイクの検出の両方のタスクに対して提案された方法の強力なパフォーマンスを示しています。

Due to its high societal impact, deepfake detection is getting active attention in the computer vision community. Most deepfake detection methods rely on identity, facial attribute and adversarial perturbation based spatio-temporal modifications at the whole video or random locations, while keeping the meaning of the content intact. However, a sophisticated deepfake may contain only a small segment of video/audio manipulation, through which the meaning of the content can be, for example, completely inverted from sentiment perspective. To address this gap, we introduce a content driven audio-visual deepfake dataset, termed as Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization. Specifically, the content driven audio-visual manipulations are performed at strategic locations in order to change the sentiment polarity of the whole video. Our baseline method for benchmarking the proposed dataset is a 3DCNN model, termed as Boundary Aware Temporal Forgery Detection (BA-TFD), which is guided via contrastive, boundary matching and frame classification loss functions. Our extensive quantitative analysis demonstrates the strong performance of the proposed method for both task of temporal forgery localization and deepfake detection.

updated: Wed Apr 13 2022 08:02:11 GMT+0000 (UTC)

published: Wed Apr 13 2022 08:02:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト