Landmark Enhanced Multimodal Graph Learning for Deepfake Video Detection

Zhiyuan Yan; Peng Sun; Yubo Lang; Shuo Du; Shanzhuo Zhang; Wei Wang

ディープフェイク動画検出のための Landmark 拡張マルチモーダルグラフ学習

顔偽造技術の急速な発展に伴い、ディープフェイク動画はデジタルメディアで広く注目を集めています。加害者は、これらのビデオを多用して偽情報を広め、誤解を招くような発言をします。ディープフェイクを検出するための既存の方法のほとんどは、主にテクスチャの特徴に焦点を当てており、照明やノイズなどの外部変動の影響を受ける可能性があります。さらに、顔のランドマークに基づく検出方法は、外部変数に対してより堅牢ですが、十分な詳細がありません。したがって、空間、時間、および周波数ドメインで特徴的な特徴を効果的にマイニングし、それらを顔のランドマークと融合して偽造ビデオを検出する方法は、まだ未解決の問題です。この目的のために、複数のモダリティ情報と顔のランドマークの幾何学的特徴に基づいて、ランドマーク拡張マルチモーダルグラフニューラルネットワーク (LEM-GNN) を提案します。具体的には、フレームレベルで、モデルの堅牢性を高めるために幾何学的な顔の特徴を導入しながら、空間ドメイン要素と周波数ドメイン要素の結合表現をマイニングする融合メカニズムを設計しました。ビデオレベルでは、まずビデオの各フレームをグラフのノードと見なし、時間情報をグラフのエッジにエンコードします。次に、グラフニューラルネットワーク（GNN）のメッセージパッシングメカニズムを適用することにより、マルチモーダル機能を効果的に組み合わせて、ビデオ偽造の包括的な表現を取得します。広範な実験により、私たちの方法が広く使用されているベンチマークで最先端の (SOTA) よりも一貫して優れていることが示されています。

With the rapid development of face forgery technology, deepfake videos have attracted widespread attention in digital media. Perpetrators heavily utilize these videos to spread disinformation and make misleading statements. Most existing methods for deepfake detection mainly focus on texture features, which are likely to be impacted by external fluctuations, such as illumination and noise. Besides, detection methods based on facial landmarks are more robust against external variables but lack sufficient detail. Thus, how to effectively mine distinctive features in the spatial, temporal, and frequency domains and fuse them with facial landmarks for forgery video detection is still an open question. To this end, we propose a Landmark Enhanced Multimodal Graph Neural Network (LEM-GNN) based on multiple modalities' information and geometric features of facial landmarks. Specifically, at the frame level, we have designed a fusion mechanism to mine a joint representation of the spatial and frequency domain elements while introducing geometric facial features to enhance the robustness of the model. At the video level, we first regard each frame in a video as a node in a graph and encode temporal information into the edges of the graph. Then, by applying the message passing mechanism of the graph neural network (GNN), the multimodal feature will be effectively combined to obtain a comprehensive representation of the video forgery. Extensive experiments show that our method consistently outperforms the state-of-the-art (SOTA) on widely-used benchmarks.

updated: Mon Sep 12 2022 17:17:49 GMT+0000 (UTC)

published: Mon Sep 12 2022 17:17:49 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト