An Audio-Visual Attention Based Multimodal Network for Fake Talking Face Videos Detection

Ganglai Wang; Peng Zhang; Lei Xie; Wei Huang; Yufei Zha; Yanning Zhang

偽の話す顔ビデオ検出のための視聴覚注意ベースのマルチモーダルネットワーク

DeepFakeベースのデジタル顔の偽造は、公共メディアのセキュリティを脅かしています。特に、話す顔の生成に唇の操作が使用されている場合、偽のビデオ検出の難しさがさらに改善されます。与えられたスピーチに一致するように唇の形を変えるだけでは、アイデンティティの顔の特徴は、そのような偽の話す顔のビデオでは区別されにくいです。事前知識としてのオーディオストリームへの注意の欠如とともに、偽の話す顔の生成の検出の失敗も避けられなくなります。聴覚情報が情報に基づく決定出力のための感覚後の視覚的証拠を強化することを可能にする人間の多感覚知覚システムの意思決定メカニズムに触発されて、この研究では、偽の話す顔検出フレームワークFTFDNetが音声と視覚表現を組み込むことによって提案されますより正確な偽の話す顔のビデオの検出を実現します。さらに、オーディオビジュアルアテンションメカニズム（AVAM）が提案され、モジュール化によってオーディオビジュアルCNNアーキテクチャにシームレスに統合できるより有益な機能を発見します。追加のAVAMを使用すると、提案されたFTFDNetは、確立されたデータセット（FTFDD）でより優れた検出パフォーマンスを実現できます。提案された作品の評価は、97％を超える検出率に到達することができる偽の話す顔のビデオの検出において優れた性能を示しました。

DeepFake based digital facial forgery is threatening the public media security, especially when lip manipulation has been used in talking face generation, the difficulty of fake video detection is further improved. By only changing lip shape to match the given speech, the facial features of identity is hard to be discriminated in such fake talking face videos. Together with the lack of attention on audio stream as the prior knowledge, the detection failure of fake talking face generation also becomes inevitable. Inspired by the decision-making mechanism of human multisensory perception system, which enables the auditory information to enhance post-sensory visual evidence for informed decisions output, in this study, a fake talking face detection framework FTFDNet is proposed by incorporating audio and visual representation to achieve more accurate fake talking face videos detection. Furthermore, an audio-visual attention mechanism (AVAM) is proposed to discover more informative features, which can be seamlessly integrated into any audio-visual CNN architectures by modularization. With the additional AVAM, the proposed FTFDNet is able to achieve a better detection performance on the established dataset (FTFDD). The evaluation of the proposed work has shown an excellent performance on the detection of fake talking face videos, which is able to arrive at a detection rate above 97%.

updated: Thu Mar 10 2022 06:16:11 GMT+0000 (UTC)

published: Thu Mar 10 2022 06:16:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト