Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media

Peipei Liu; Gaosheng Wang; Hong Li; Jie Liu; Yimo Ren; Hongsong Zhu; Limin Sun

ソーシャルメディアにおける固有表現認識のための多粒度クロスモダリティ表現学習

ソーシャルメディアにおける NER (Named Entity Recognition) は、構造化されていない自由形式のコンテンツからエンティティを発見して分類することを指し、意図の理解やユーザーのレコメンデーションなど、さまざまな用途で重要な役割を果たします。ソーシャルメディアの投稿はマルチモーダルになりがちであり、一部のテキストコンポーネントは視覚情報との組み合わせでしか理解できないため、付随する画像を含むテキストのマルチモーダル名前付きエンティティ認識 (MNER) がますます注目されています。ただし、既存のアプローチには 2 つの欠点があります。1) テキストの意味とそれに付随する画像は常に一致するとは限らないため、テキスト情報は依然として重要な役割を果たします。ただし、ソーシャルメディアの投稿は通常、他の通常のコンテンツと比較して短く、より非公式であるため、不完全な意味の説明とデータの希薄性の問題が発生しやすくなります。 2) 画像全体またはオブジェクトの視覚的表現は既に使用されていますが、既存の方法では、画像内のオブジェクトとテキスト内の単語の間のきめの細かい意味的対応、または一部の画像に誤解を招くオブジェクトまたはオブジェクトがないという客観的事実のいずれかが無視されます。この作業では、マルチグラニュラリティクロスモダリティ表現学習を導入することにより、上記の 2 つの問題を解決します。最初の問題を解決するために、テキスト内の各単語の意味拡張によって表現を強化します。 2番目の問題については、テキストとビジョンの間のクロスモダリティのセマンティック相互作用を異なるビジョン粒度で実行して、すべての単語に対して最も効果的なマルチモーダルガイダンス表現を取得します。実験は、提案されたアプローチがツイートの2つのベンチマークデータセットでSOTAまたは近似SOTAパフォーマンスを達成できることを示しています。コード、データ、および最高のパフォーマンスを発揮するモデルは、https://github.com/LiuPeiP-CS/IIE4MNER で入手できます。

Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content, and it plays an important role for various applications such as intention understanding and user recommendation. With social media posts tending to be multimodal, Multimodal Named Entity Recognition (MNER) for the text with its accompanying image is attracting more and more attention since some textual components can only be understood in combination with visual information. However, there are two drawbacks in existing approaches: 1) Meanings of the text and its accompanying image do not match always, so the text information still plays a major role. However, social media posts are usually shorter and more informal compared with other normal contents, which easily causes incomplete semantic description and the data sparsity problem. 2) Although the visual representations of whole images or objects are already used, existing methods ignore either fine-grained semantic correspondence between objects in images and words in text or the objective fact that there are misleading objects or no objects in some images. In this work, we solve the above two problems by introducing the multi-granularity cross-modality representation learning. To resolve the first problem, we enhance the representation by semantic augmentation for each word in text. As for the second issue, we perform the cross-modality semantic interaction between text and vision at the different vision granularity to get the most effective multimodal guidance representation for every word. Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets. The code, data and the best performing models are available at https://github.com/LiuPeiP-CS/IIE4MNER

updated: Sun Nov 20 2022 02:02:24 GMT+0000 (UTC)

published: Wed Oct 19 2022 15:14:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト