Improving Visual-textual Sentiment Analysis by Fusing Expert Features

Junyu Chen; Jie An; Hanjia Lyu; Jiebo Luo

エキスパート機能の融合による視覚的テキスト感情分析の改善

Visual-textual センチメント分析は、画像とテキストのペアを入力してセンチメントを予測することを目的としています。入力画像は非常に多様であることが多いため、視覚テキスト感情分析の主な課題は、感情予測に効果的な視覚的特徴を学習する方法です。この課題に対処するために、強力な専門家の視覚的機能を導入することにより、視覚的テキスト感情分析を改善する新しい方法を提案します。提案された方法は、次の 4 つの部分で構成されます。(1) 感情分析のためにデータから特徴を直接学習するビジュアルテキストブランチ、(2) 効果的な視覚的特徴を抽出するための事前トレーニング済みの「エキスパート」エンコーダーのセットを含むビジュアルエキスパートブランチ、 (3) 視覚とテキストの対応を暗黙的にモデル化する CLIP ブランチ、および (4) BERT または MLP に基づくマルチモーダル機能融合ネットワークで、マルチモーダル機能を融合し、センチメント予測を行います。 3 つのデータセットでの広範な実験により、私たちの方法が既存の方法よりも優れた視覚テキスト感情分析パフォーマンスを生み出すことが示されています。

Visual-textual sentiment analysis aims to predict sentiment with the input of a pair of image and text. The main challenge of visual-textual sentiment analysis is how to learn effective visual features for sentiment prediction since input images are often very diverse. To address this challenge, we propose a new method that improves visual-textual sentiment analysis by introducing powerful expert visual features. The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract effective visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on either BERT or MLP to fuse multimodal features and make sentiment prediction. Extensive experiments on three datasets show that our method produces better visual-textual sentiment analysis performance than existing methods.

updated: Wed Nov 23 2022 14:40:51 GMT+0000 (UTC)

published: Wed Nov 23 2022 14:40:51 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト