Audio-Visual Person-of-Interest DeepFake Detection

Davide Cozzolino; Alessandro Pianese; Matthias Nießner; Luisa Verdoliva

オーディオビジュアル要注意人物ディープフェイク検出

顔操作技術は急速に進歩しており、日々新たな手法が提案されています。この研究の目的は、現実世界で遭遇するさまざまな操作方法やシナリオに対応できるディープフェイク検出器を提案することです。私たちの重要な洞察は、人はそれぞれ、合成ジェネレータでは再現できない可能性が高い特定の特性を持っているということです。したがって、人の身元を特徴づける視聴覚特徴を抽出し、それらを使用して注目人物 (POI) ディープフェイク検出器を作成します。私たちは、対照的な学習パラダイムを活用して、各アイデンティティを最も識別する動きのある顔と音声セグメントの埋め込みを学習します。その結果、人物のビデオや音声が操作されると、埋め込み空間内でのその表現は実際の身元と一致しなくなり、信頼性の高い検出が可能になります。トレーニングは、実際の話している顔のビデオのみで実施されます。したがって、検出器は特定の操作方法に依存せず、最高の一般化能力をもたらします。さらに、私たちの方法は単一モダリティ (音声のみ、ビデオのみ) 攻撃とマルチモダリティ (音声とビデオ) の両方の攻撃を検出でき、低品質または破損したビデオに対して堅牢です。さまざまなデータセットでの実験により、私たちの方法が、特に低品質のビデオで SOTA パフォーマンスを保証することが確認されました。コードは、https://github.com/grip-unina/poi-forensics でオンラインで公開されています。

Face manipulation technology is advancing very rapidly, and new methods are being proposed day by day. The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. Our key insight is that each person has specific characteristics that a synthetic generator likely cannot reproduce. Accordingly, we extract audio-visual features which characterize the identity of a person, and use them to create a person-of-interest (POI) deepfake detector. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. As a result, when the video and/or audio of a person is manipulated, its representation in the embedding space becomes inconsistent with the real identity, allowing reliable detection. Training is carried out exclusively on real talking-face video; thus, the detector does not depend on any specific manipulation method and yields the highest generalization ability. In addition, our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos. Experiments on a wide variety of datasets confirm that our method ensures a SOTA performance, especially on low quality videos. Code is publicly available on-line at https://github.com/grip-unina/poi-forensics.

updated: Thu May 18 2023 06:56:42 GMT+0000 (UTC)

published: Wed Apr 06 2022 20:51:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト