Dynamic Emotion Modeling with Learnable Graphs and Graph Inception Network

A. Shirian; S. Tripathi; T. Guha

学習可能なグラフとグラフ開始ネットワークを使用した動的感情モデリング

人間の感情は、音声（口頭）、ビデオ（顔の表情）、モーションセンサー（体のジェスチャー）など、さまざまな動的データモダリティを使用して表現、認識、キャプチャされます。動的データを構造化グラフとしてモデル化することにより、モダリティ間で適応できる感情認識への一般化されたアプローチを提案します。グラフアプローチの背後にある動機は、パフォーマンスを犠牲にすることなくコンパクトなモデルを構築することです。最適なグラフ構築の問題を軽減するために、これをグラフの学習と分類の共同タスクとしてキャストします。この目的のために、感情を認識し、動的データの基礎となるグラフ構造を識別することを共同で学習するLearnable Graph Inception Network（L-GrIN）を紹介します。私たちのアーキテクチャは、複数の新しいコンポーネントで構成されています。新しいグラフ畳み込み演算、グラフ開始レイヤー、学習可能な隣接関係、およびグラフレベルの埋め込みを生成する学習可能なプーリング関数です。 3つの異なるモダリティ（ビデオ、オーディオ、モーションキャプチャ）にまたがる5つのベンチマーク感情認識データベースで提案されたアーキテクチャを評価します。各データベースは、顔の表情、音声、身体のジェスチャーのいずれかの感情的な手がかりをキャプチャします。 5つのデータベースすべてで最先端のパフォーマンスを達成し、いくつかの競合ベースラインおよび関連する既存の方法を上回っています。私たちのグラフアーキテクチャは、リソースに制約のあるデバイスへの適用性を約束する、（畳み込みニューラルネットワークまたはリカレントニューラルネットワークと比較して）大幅に少ないパラメータで優れたパフォーマンスを示しています。

Human emotion is expressed, perceived and captured using a variety of dynamic data modalities, such as speech (verbal), videos (facial expressions) and motion sensors (body gestures). We propose a generalized approach to emotion recognition that can adapt across modalities by modeling dynamic data as structured graphs. The motivation behind the graph approach is to build compact models without compromising on performance. To alleviate the problem of optimal graph construction, we cast this as a joint graph learning and classification task. To this end, we present the Learnable Graph Inception Network (L-GrIN) that jointly learns to recognize emotion and to identify the underlying graph structure in the dynamic data. Our architecture comprises multiple novel components: a new graph convolution operation, a graph inception layer, learnable adjacency, and a learnable pooling function that yields a graph-level embedding. We evaluate the proposed architecture on five benchmark emotion recognition databases spanning three different modalities (video, audio, motion capture), where each database captures one of the following emotional cues: facial expressions, speech and body gestures. We achieve state-of-the-art performance on all five databases outperforming several competitive baselines and relevant existing methods. Our graph architecture shows superior performance with significantly fewer parameters (compared to convolutional or recurrent neural networks) promising its applicability to resource-constrained devices.

updated: Mon Feb 08 2021 12:21:00 GMT+0000 (UTC)

published: Thu Aug 06 2020 13:51:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト