Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors

Hasam Khalid; Minha Kim; Shahroz Tariq; Simon S. Woo

ユニモーダルおよびマルチモーダル検出器を使用したオーディオビデオマルチモーダルディープフェイクデータセットの評価

ディープフェイクの生成が大幅に進歩したことで、セキュリティとプライバシーの問題が発生しました。攻撃者は、人物の顔を標的の人物の顔に置き換えることで、画像内の人物の身元を簡単に偽装できます。さらに、ディープラーニングテクノロジーを使用して人間の声を複製する新しいドメインも出現しています。これで、攻撃者は、標的となる人物のわずか数秒の音声を使用して、人間の現実的なクローン音声を生成できます。ディープフェイクが引き起こす可能性のある潜在的な危害の脅威が浮上しているため、研究者はディープフェイクの検出方法を提案しています。ただし、これらは単一のモダリティ、つまりビデオまたはオーディオの検出にのみ焦点を当てています。一方、ディープフェイク生成の最近の進歩に対応できる優れたディープフェイク検出器を開発するには、ビデオやオーディオなど、複数のモダリティのディープフェイクを検出できる検出器が必要です。このような検出器を構築するには、ビデオとそれぞれのオーディオディープフェイクを含むデータセットが必要です。最新のディープフェイクデータセットであるオーディオビデオマルチモーダルディープフェイク検出データセット（FakeAVCeleb）を見つけることができました。このデータセットには、ディープフェイクビデオだけでなく、合成されたフェイクオーディオも含まれています。このマルチモーダルディープフェイクデータセットを使用し、最先端のユニモーダル、アンサンブルベース、およびマルチモーダルの検出方法を使用して詳細なベースライン実験を実行し、それを評価しました。詳細な実験を通じて、単一のモダリティ、ビデオ、またはオーディオのみに対応するユニモーダルは、アンサンブルベースの方法と比較してうまく機能しないと結論付けています。一方、純粋にマルチモーダルベースのベースラインは最悪のパフォーマンスを提供します。

Significant advancements made in the generation of deepfakes have caused security and privacy issues. Attackers can easily impersonate a person's identity in an image by replacing his face with the target person's face. Moreover, a new domain of cloning human voices using deep-learning technologies is also emerging. Now, an attacker can generate realistic cloned voices of humans using only a few seconds of audio of the target person. With the emerging threat of potential harm deepfakes can cause, researchers have proposed deepfake detection methods. However, they only focus on detecting a single modality, i.e., either video or audio. On the other hand, to develop a good deepfake detector that can cope with the recent advancements in deepfake generation, we need to have a detector that can detect deepfakes of multiple modalities, i.e., videos and audios. To build such a detector, we need a dataset that contains video and respective audio deepfakes. We were able to find a most recent deepfake dataset, Audio-Video Multimodal Deepfake Detection Dataset (FakeAVCeleb), that contains not only deepfake videos but synthesized fake audios as well. We used this multimodal deepfake dataset and performed detailed baseline experiments using state-of-the-art unimodal, ensemble-based, and multimodal detection methods to evaluate it. We conclude through detailed experimentation that unimodals, addressing only a single modality, video or audio, do not perform well compared to ensemble-based methods. Whereas purely multimodal-based baselines provide the worst performance.

updated: Tue Sep 07 2021 11:00:20 GMT+0000 (UTC)

published: Tue Sep 07 2021 11:00:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト