Multimodal Deception Detection in Videos via Analyzing Emotional State-based Feature

Jun-Teng Yang; Guei-Ming Liu; Scott C. -H Huang

感情状態に基づく特徴の分析によるビデオのマルチモーダル欺瞞検出

欺瞞の検出は、その潜在的なアプリケーションのためにホットな研究トピックとなっている重要なタスクです。これは、国家安全保障（空港のセキュリティ、法学、法執行機関など）から実際のアプリケーション（ビジネスやコンピュータービジョンなど）まで、多くの分野に適用できます。ただし、いくつかの重大な問題がまだ存在しており、さらに調査する価値があります。主要な課題の1つは、データ不足の問題です。これまで、欺瞞検出に関するマルチモーダルベンチマークデータセットは1つしか公開されていません。これには、欺瞞検出用の121のビデオクリップが含まれています（欺瞞クラスの場合は61、真実クラスの場合は60）。この量のデータは、ディープニューラルネットワークベースの方法を推進するのが困難です。したがって、彼らはしばしば過剰適合の問題と悪い一般化能力に苦しんでいました。また、グラウンドトゥルースデータには、顔が小さすぎて表情を認識できない、顔がテキストで覆われている、ファイルの破損など、多くの要因で使用できないフレームが含まれています。ただし、ほとんどの文献ではこれらの問題は考慮されていません。この論文では、最初に問題に対処するための一連のデータ前処理方法を設計します。次に、マルチモーダル詐欺検出フレームワークを提案して、新しい感情状態ベースの機能を構築し、オープンツールキットopenSMILEを使用してオーディオモダリティから機能を抽出しました。投票スキームは、視覚モダリティと音声モダリティの両方から得られた感情状態情報を組み合わせるようにも設計されています。最後に、新しい感情状態変換（EST）機能は、アルゴリズムによって決定されます。提案された方法と最先端のマルチモーダル方法との批判的分析と比較により、全体的なパフォーマンスの精度が84.16％から91.67％に、ROC-AUCが0.9211から0.9244に大幅に向上することが示されています。

Deception detection is an important task that has been a hot research topic due to its potential applications. It can be applied to many areas from national security (e.g, airport security, jurisprudence, and law enforcement) to real-life applications (e.g., business and computer vision). However, some critical problems still exist and worth more investigation. One of the major challenges is the data scarcity problem. Until now, only one multimodal benchmark dataset on deception detection has been published, which contains 121 video clips for deception detection (61 for deceptive class and 60 for truthful class). This amount of data is hard to drive deep neural network-based methods. Hence, they often suffered from the overfitting problem and the bad generalization ability. Also, the ground truth data contains some unusable frames for many factors including the face is too small to be recognized the facial expression, face is covered by text, file corruption, etc. However, most of the literature did not consider these problems. In this paper, we design a series of data preprocessing methods to deal with the problem first. Then, we propose a multimodal deception detection framework to construct our novel emotional state-based feature and used open toolkit openSMILE to extract the features from audio modality. A voting scheme is also designed to combine the emotional state information obtained from both visual modality and audio modality. Finally, the novel emotion state transformation (EST) feature is determined by our algorithm. The critical analysis and comparison of the proposed methods with the state-of-the-art multimodal method are showed that the overall performance has a great improvement of accuracy from 84.16% to 91.67% and ROC-AUC from 0.9211 to 0.9244.

updated: Fri Apr 16 2021 21:20:32 GMT+0000 (UTC)

published: Fri Apr 16 2021 21:20:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト