Whose Emotion Matters? Speaker Detection without Prior Knowledge

Hugo Carneiro; Cornelius Weber; Stefan Wermter

誰の感情が重要?事前知識なしの話者検出

会話における感情認識 (ERC) のタスクは、ビデオベースの MELD データセットなどで提供されるように、複数のモダリティを利用できるという利点があります。ただし、MELD ビデオからの音響情報と視覚情報の両方を使用する研究アプローチはごくわずかです。これには 2 つの理由があります。第 1 に、MELD のラベルからビデオへのアラインメントはノイズが多いため、これらのビデオは感情的な音声データの信頼できないソースになります。第 2 に、会話には同じシーンに複数の人が関与する可能性があるため、発話を話している人の検出が必要になります。この論文では、最近の自動音声認識とアクティブスピーカー検出モデルを使用することで、MELD のビデオを再調整し、MELD で提供される発話の 96.92% で発話者の顔の表情をキャプチャできることを示します。自己教師あり音声認識モデルを使用した実験では、再調整された MELD ビデオが、データセットで提供される対応する発話とより厳密に一致することが示されています。最後に、MELD の再編成されたビデオの顔と音声情報でトレーニングされた会話での感情認識のモデルを考案しました。これは、視覚のみに基づく ERC の最先端のモデルよりも優れています。これは、発言中の話者から顔の表情を抽出するのにアクティブスピーカーの検出が実際に効果的であること、および最先端のモデルがこれまで使用してきた視覚的特徴よりも顔がより有益な視覚的合図を提供することを示しています。

The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as offered, for example, in the video-based MELD dataset. However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the detection of the person speaking the utterance. In this paper we demonstrate that by using recent automatic speech recognition and active speaker detection models, we are able to realign the videos of MELD, and capture the facial expressions from uttering speakers in 96.92% of the utterances provided in MELD. Experiments with a self-supervised voice recognition model indicate that the realigned MELD videos more closely match the corresponding utterances offered in the dataset. Finally, we devise a model for emotion recognition in conversations trained on the face and audio information of the MELD realigned videos, which outperforms state-of-the-art models for ERC based on vision alone. This indicates that active speaker detection is indeed effective for extracting facial expressions from the uttering speakers, and that faces provide more informative visual cues than the visual features state-of-the-art models have been using so far.

updated: Thu Dec 08 2022 11:00:05 GMT+0000 (UTC)

published: Wed Nov 23 2022 09:57:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト