Multi-Granularity Network with Modal Attention for Dense Affective Understanding

Baoming Yan; Lin Wang; Ke Gao; Bo Gao; Xiao Liu; Chao Ban; Jiang Yang; Xiaobo Li

密な感情的理解のためのモーダル注意を備えたマルチグラニュラリティネットワーク

ビデオの作成と推奨には、ビデオコンテンツによって引き起こされる表現を予測することを目的としたビデオ感情理解が望まれます。最近のEEVチャレンジでは、密な感情理解タスクが提案されており、フレームレベルの感情予測が必要です。本論文では、モーダルアテンションを備えたマルチグラニュラリティネットワーク（MGN-MA）を提案します。これは、ターゲットフレームのより良い記述のためにマルチグラニュラリティ機能を採用しています。具体的には、マルチグラニュラリティ機能は、フレームレベル、クリップレベル、およびビデオレベルの機能に分割できます。これらの機能は、視覚的に目立つコンテンツ、セマンティックコンテキスト、およびビデオテーマ情報に対応します。次に、モーダルアテンションフュージョンモジュールは、マルチグラニュラリティ機能をフュージョンし、より愛情に関連するモーダルを強調するように設計されています。最後に、融合された特徴は、式を予測するためにMixtures Of Experts（MOE）分類器に供給されます。さらにモデルアンサンブル後処理を使用して、提案された方法は、EEVチャレンジで0.02292の相関スコアを達成します。

Video affective understanding, which aims to predict the evoked expressions by the video content, is desired for video creation and recommendation. In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. In this paper, we propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. Specifically, the multi-granularity features could be divided into frame-level, clips-level and video-level features, which corresponds to visual-salient content, semantic-context and video theme information. Then the modal attention fusion module is designed to fuse the multi-granularity features and emphasize more affection-relevant modals. Finally, the fused feature is fed into a Mixtures Of Experts (MOE) classifier to predict the expressions. Further employing model-ensemble post-processing, the proposed method achieves the correlation score of 0.02292 in the EEV challenge.

updated: Fri Jun 18 2021 07:37:06 GMT+0000 (UTC)

published: Fri Jun 18 2021 07:37:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト