Dual-branch Cross-Patch Attention Learning for Group Affect Recognition

Hongxia Xie; Ming-Xian Lee; Tzu-Jui Chen; Hung-Jen Chen; Hou-I Liu; Hong-Han Shuai; Wen-Huang Cheng

グループ感情認識のためのデュアルブランチクロスパッチ注意学習

集団感情とは、集団内の外部刺激によって引き起こされる主観的な感情を指し、これは集団の行動と結果を形作る重要な要素です。グループの影響を認識することは、感情を呼び起こすことができる群衆の中から重要な個人と顕著なオブジェクトを識別することを含みます.既存の方法のほとんどは、事前にトレーニングされた検出器を使用して顔やオブジェクトを検出し、その結果を特定のルールによってグループの感情に要約するために提案されています。ただし、このような感情領域選択メカニズムはヒューリスティックであり、事前にトレーニングされた検出器からの不完全な顔やオブジェクトの影響を受けやすくなります。さらに、グループレベルの画像上の顔やオブジェクトは、多くの場合、文脈的に関連しています。重要な顔やオブジェクトをどのように操作できるかについては、まだ未解決の問題があります。この作品では、Most Important Person (MIP) という心理学的概念を取り入れています。群衆の中で最も注目すべき顔を表し、感情的な意味を持っています。グローバル画像と MIP を入力として使用する Dual-branch Cross-Patch Attention Transformer (DCAT) を提案します。具体的には、最初に、MIP によって生成された有益な顔領域とグローバルコンテキストを別々に学習します。次に、MIP の機能とグローバルコンテキストを融合して相互に補完する Cross-Patch Attention モジュールが提案されています。パラメーターが 10 倍未満の場合、提案された DCAT は、グループ原子価予測の 2 つのデータセット、GAF 3.0 および GroupEmoW データセットで最先端の方法よりも優れています。さらに、提案されたモデルは、別のグループ影響タスク、グループ結束に転送でき、同等の結果を示します。

Group affect refers to the subjective emotion that is evoked by an external stimulus in a group, which is an important factor that shapes group behavior and outcomes. Recognizing group affect involves identifying important individuals and salient objects among a crowd that can evoke emotions. Most of the existing methods are proposed to detect faces and objects using pre-trained detectors and summarize the results into group emotions by specific rules. However, such affective region selection mechanisms are heuristic and susceptible to imperfect faces and objects from the pre-trained detectors. Moreover, faces and objects on group-level images are often contextually relevant. There is still an open question about how important faces and objects can be interacted with. In this work, we incorporate the psychological concept called Most Important Person (MIP). It represents the most noteworthy face in the crowd and has an affective semantic meaning. We propose the Dual-branch Cross-Patch Attention Transformer (DCAT) which uses global image and MIP together as inputs. Specifically, we first learn the informative facial regions produced by the MIP and the global context separately. Then, the Cross-Patch Attention module is proposed to fuse the features of MIP and global context together to complement each other. With parameters less than 10x, the proposed DCAT outperforms state-of-the-art methods on two datasets of group valence prediction, GAF 3.0 and GroupEmoW datasets. Moreover, our proposed model can be transferred to another group affect task, group cohesion, and shows comparable results.

updated: Wed Dec 14 2022 06:51:39 GMT+0000 (UTC)

published: Wed Dec 14 2022 06:51:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト