Multi-Modal Semantic Inconsistency Detection in Social Media News Posts

Scott McCrae; Kehan Wang; Avideh Zakhor

ソーシャルメディアのニュース投稿におけるマルチモーダルセマンティック不整合の検出

コンピューターで生成されたコンテンツとディープフェイクが着実に改善されるにつれて、マルチメディアフォレンジックへのセマンティックアプローチがより重要になります。この論文では、ソーシャルメディアのニュース投稿におけるビデオの外観とテキストキャプションの間の意味の不一致を識別するための新しい分類アーキテクチャを紹介します。キャプションのテキスト分析、自動音声文字変換、セマンティックビデオ分析、オブジェクト検出、名前付きエンティティの一貫性、顔の検証に基づくアンサンブル手法を活用して、ソーシャルメディア投稿のビデオとキャプションの不一致を特定するマルチモーダルフュージョンフレームワークを開発します。私たちのアプローチをトレーニングしてテストするために、分析のために4,000の実際のFacebookニュース投稿の新しいビデオベースのデータセットをキュレートします。私たちのマルチモーダルアプローチは、ユニモーダルモデルの50％未満の精度と比較して、キャプションと外観の間のランダムな不一致で60.5％の分類精度を達成します。さらなるアブレーション研究は、意味の不一致を正しく識別するためにモダリティ間の融合の必要性を確認します。

As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.

updated: Wed May 26 2021 21:25:27 GMT+0000 (UTC)

published: Wed May 26 2021 21:25:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト