Continuous Emotion Recognition with Spatiotemporal Convolutional Neural Networks

Thomas Teixeira; Eric Granger; Alessandro Lameiras Koerich

時空間畳み込みニューラルネットワークによる継続的な感情認識

顔の表情は、人間の行動の特定のパターンを描写し、人間の感情状態を説明するための最も強力な方法の1つです。過去10年間の感情コンピューティングの目覚ましい進歩にもかかわらず、顔の表情を認識するための自動ビデオベースのシステムは、個人間の顔の表情の変化や、異文化間および人口統計学的側面を適切に処理できません。それにもかかわらず、顔の表情を認識することは、人間にとってさえ難しい作業です。この論文では、畳み込みニューラルネットワーク（CNN）に基づく最先端の深層学習アーキテクチャが、野生でキャプチャされた長いビデオシーケンスを使用して継続的な感情認識に適しているかどうかを調査します。この研究は、価数と覚醒の値を予測する必要がある複雑で多次元の感情空間を考慮して、ビデオの時空間関係をエンコードできるディープラーニングモデルに焦点を当てています。 2D-CNNと長短期記憶ユニットを組み合わせた畳み込みリカレントニューラルネットワークと、事前にトレーニングされた2D-CNNモデルの重みを微調整中に膨張させることによって構築された膨張3D-CNNモデルを開発および評価しました。アプリケーション固有のビデオ。挑戦的なSEWA-DBデータセットでの実験結果は、これらのアーキテクチャを効果的に微調整して、連続する生のピクセル画像から時空間情報をエンコードし、そのようなデータセットで最先端の結果を達成できることを示しています。

Facial expressions are one of the most powerful ways for depicting specific patterns in human behavior and describing human emotional state. Despite the impressive advances of affective computing over the last decade, automatic video-based systems for facial expression recognition still cannot handle properly variations in facial expression among individuals as well as cross-cultural and demographic aspects. Nevertheless, recognizing facial expressions is a difficult task even for humans. In this paper, we investigate the suitability of state-of-the-art deep learning architectures based on convolutional neural networks (CNNs) for continuous emotion recognition using long video sequences captured in-the-wild. This study focuses on deep learning models that allow encoding spatiotemporal relations in videos considering a complex and multi-dimensional emotion space, where values of valence and arousal must be predicted. We have developed and evaluated convolutional recurrent neural networks combining 2D-CNNs and long short term-memory units, and inflated 3D-CNN models, which are built by inflating the weights of a pre-trained 2D-CNN model during fine-tuning, using application-specific videos. Experimental results on the challenging SEWA-DB dataset have shown that these architectures can effectively be fine-tuned to encode the spatiotemporal information from successive raw pixel images and achieve state-of-the-art results on such a dataset.

updated: Fri Jan 15 2021 14:49:00 GMT+0000 (UTC)

published: Wed Nov 18 2020 13:42:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト