Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features

Gokul Karthik Kumar; Karthik Nandakumar

Hate-CLIPper: CLIP 機能のクロスモーダル相互作用に基づくマルチモーダルなヘイトミーム分類

ヘイトミームは、ソーシャルメディア上でますます脅威となっています。ミーム内の画像とそれに対応するテキストは関連していますが、個別に見たときに必ずしも同じ意味を伝えるとは限りません。したがって、ヘイトミームを検出するには、視覚情報とテキスト情報の両方を慎重に検討する必要があります。マルチモーダル事前トレーニングは、画像とテキストを同様の特徴空間で表現することにより、それらの関係を効果的に捉えるため、このタスクに有益です。さらに、中間融合による画像とテキストの特徴間の相互作用をモデル化することが不可欠です。ほとんどの既存の方法は、マルチモーダル事前トレーニングまたは中間融合のいずれかを採用していますが、両方は採用していません。この作業では、Hate-CLIPper アーキテクチャを提案します。これは、特徴相互作用マトリックス (FIM) を介して Contrastive Language-Image Pre-training (CLIP) エンコーダーを使用して取得した画像とテキスト表現の間のクロスモーダル相互作用を明示的にモデル化します。 FIM 表現に基づく単純な分類器は、Hateful Memes Challenge (HMC) データセットで最先端のパフォーマンスを達成でき、AUROC は 85.8 であり、人間のパフォーマンスである 82.65 をも上回っています。 Propaganda Memes や TamilMemes などの他のミームデータセットでの実験も、提案されたアプローチの一般化可能性を示しています。最後に、FIM 表現の解釈可能性を分析し、クロスモーダルな相互作用が意味のある概念の学習を実際に促進できることを示します。この作業のコードは、https://github.com/gokulkarthik/hateclipper で入手できます。

Hateful memes are a growing menace on social media. While the image and its corresponding text in a meme are related, they do not necessarily convey the same meaning when viewed individually. Hence, detecting hateful memes requires careful consideration of both visual and textual information. Multimodal pre-training can be beneficial for this task because it effectively captures the relationship between the image and the text by representing them in a similar feature space. Furthermore, it is essential to model the interactions between the image and text features through intermediate fusion. Most existing methods either employ multimodal pre-training or intermediate fusion, but not both. In this work, we propose the Hate-CLIPper architecture, which explicitly models the cross-modal interactions between the image and text representations obtained using Contrastive Language-Image Pre-training (CLIP) encoders via a feature interaction matrix (FIM). A simple classifier based on the FIM representation is able to achieve state-of-the-art performance on the Hateful Memes Challenge (HMC) dataset with an AUROC of 85.8, which even surpasses the human performance of 82.65. Experiments on other meme datasets such as Propaganda Memes and TamilMemes also demonstrate the generalizability of the proposed approach. Finally, we analyze the interpretability of the FIM representation and show that cross-modal interactions can indeed facilitate the learning of meaningful concepts. The code for this work is available at https://github.com/gokulkarthik/hateclipper.

updated: Thu Oct 13 2022 07:20:23 GMT+0000 (UTC)

published: Wed Oct 12 2022 04:34:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト