Non-Volume Preserving-based Fusion to Group-Level Emotion Recognition on Crowd Videos

Kha Gia Quach; Ngan Le; Chi Nhan Duong; Ibsa Jalata; Kaushik Roy; Khoa Luu

群衆ビデオのグループレベルの感情認識への非ボリューム保存ベースの融合

グループレベルの感情認識 (ER) は、あらゆる規模の群衆を評価する必要があるため、セキュリティ分野とソーシャルメディアの両方で関心が高まっているため、研究分野が拡大しています。この作業は、群衆のビデオのグループレベルの表現認識を完全に調査することによって、単一の画像またはビデオ内のグループレベルの ER に焦点を当てた以前の ER 調査を拡張します。この論文では、群衆ビデオの時空間情報をモデル化するための効果的な深層特徴量融合メカニズムを提案します。私たちのアプローチでは、融合プロセスは、空間情報の関係をモデル化する生成確率モデルである非ボリューム保存融合 (NVPF) によって、深い特徴領域で実行されます。さらに、提案された空間 NVPF アプローチを時空間 NVPF アプローチに拡張して、フレーム間の時間情報を学習します。提案されたアプローチの各コンポーネントの堅牢性と有効性を実証するために、3 つの実験が行われました。(i) 顔の表情を認識するために提案された EmoNet をベンチマークするための AffectNet データベースでの評価。 (ii) 提案された深層特徴量レベル融合メカニズム NVPF をベンチマークするための EmotiW2018 での評価。 (iii) 公開されているソースから収集された 627 のビデオで構成される革新的なグループレベルの群衆ビデオ (GECV) データセットで提案された TNVPF を検証します。 GECV データセットは、大勢の人々を含むビデオのコレクションです。各ビデオには、個々の顔、グループ、ビデオフレーム全体の 3 つのレベルで感情のカテゴリがラベル付けされています。

Group-level emotion recognition (ER) is a growing research area as the demands for assessing crowds of all sizes are becoming an interest in both the security arena as well as social media. This work extends the earlier ER investigations, which focused on either group-level ER on single images or within a video, by fully investigating group-level expression recognition on crowd videos. In this paper, we propose an effective deep feature level fusion mechanism to model the spatial-temporal information in the crowd videos. In our approach, the fusing process is performed on the deep feature domain by a generative probabilistic model, Non-Volume Preserving Fusion (NVPF), that models spatial information relationships. Furthermore, we extend our proposed spatial NVPF approach to the spatial-temporal NVPF approach to learn the temporal information between frames. To demonstrate the robustness and effectiveness of each component in the proposed approach, three experiments were conducted: (i) evaluation on AffectNet database to benchmark the proposed EmoNet for recognizing facial expression; (ii) evaluation on EmotiW2018 to benchmark the proposed deep feature level fusion mechanism NVPF; and, (iii) examine the proposed TNVPF on an innovative Group-level Emotion on Crowd Videos (GECV) dataset composed of 627 videos collected from publicly available sources. GECV dataset is a collection of videos containing crowds of people. Each video is labeled with emotion categories at three levels: individual faces, group of people, and the entire video frame.

updated: Wed Mar 23 2022 05:41:56 GMT+0000 (UTC)

published: Wed Nov 28 2018 21:35:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト