Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition

Hengshun Zhou; Debin Meng; Yuanyuan Zhang; Xiaojiang Peng; Jun Du; Kai Wang; Yu Qiao

オーディオビデオ感情認識のための感情機能と融合戦略の探求

オーディオビデオベースの感情認識は、特定のビデオを基本的な感情に分類することを目的としています。このホワイトペーパーでは、EmotiW 2019でのアプローチについて説明します。このアプローチでは、主に感情機能と、オーディオおよびビジュアルモダリティの機能融合戦略について説明します。感情の特徴については、音声スペクトログラムとログメルスペクトログラムの両方を使用してオーディオの特徴を調査し、さまざまなCNNモデルとさまざまな感情の事前トレーニング済み戦略を使用していくつかの顔の特徴を評価します。融合戦略では、重要な感情の特徴を強調する注意メカニズムの設計、特徴の連結、クロスモーダル特徴の融合のための因数分解された双線形プーリング（FBP）など、モーダル内およびモーダル間の融合方法を検討します。注意深く評価すると、AFEW検証セットで65.5％、テストセットで62.48％が得られ、チャレンジで3番目にランク付けされます。

The audio-video based emotion recognition aims to classify a given video into basic emotions. In this paper, we describe our approaches in EmotiW 2019, which mainly explores emotion features and feature fusion strategies for audio and visual modality. For emotion features, we explore audio feature with both speech-spectrogram and Log Mel-spectrogram and evaluate several facial features with different CNN models and different emotion pretrained strategies. For fusion strategies, we explore intra-modal and cross-modal fusion methods, such as designing attention mechanisms to highlights important emotion feature, exploring feature concatenation and factorized bilinear pooling (FBP) for cross-modal feature fusion. With careful evaluation, we obtain 65.5% on the AFEW validation set and 62.48% on the test set and rank third in the challenge.

updated: Sun Dec 27 2020 10:50:24 GMT+0000 (UTC)

published: Sun Dec 27 2020 10:50:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト