Flat Multi-modal Interaction Transformer for Named Entity Recognition

Junyu Lu; Dixiang Zhang; Pingjian Zhang

名前付きエンティティ認識のためのフラットマルチモーダルインタラクショントランスフォーマー

マルチモーダル固有表現認識 (MNER) は、エンティティの範囲を識別し、画像を使用してソーシャルメディアの投稿でそれらのカテゴリを認識することを目的としています。ただし、支配的な MNER アプローチでは、さまざまなモダリティの相互作用は、通常、自己注意と相互注意の交替またはゲーティングマシンへの過度の依存によって実行されます。テキストと画像。この問題に対処するために、MNER 用のフラットマルチモーダルインタラクショントランスフォーマー (FMIT) を提案します。具体的には、まず文中の名詞句と一般的なドメインワードを使用して、視覚的な手がかりを取得します。次に、ビジョンとテキストのきめの細かいセマンティック表現を統一されたラティス構造に変換し、Transformer のさまざまなモダリティに一致する新しい相対位置エンコーディングを設計します。一方、視覚的偏見を軽減するための補助タスクとしてエンティティ境界検出を活用することを提案します。実験は、私たちの方法が2つのベンチマークデータセットで新しい最先端のパフォーマンスを達成することを示しています.

Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. However, in dominant MNER approaches, the interaction of different modalities is usually carried out through the alternation of self-attention and cross-attention or over-reliance on the gating machine, which results in imprecise and biased correspondence between fine-grained semantic units of text and image. To address this issue, we propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER. Specifically, we first utilize noun phrases in sentences and general domain words to obtain visual cues. Then, we transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer. Meanwhile, we propose to leverage entity boundary detection as an auxiliary task to alleviate visual bias. Experiments show that our methods achieve the new state-of-the-art performance on two benchmark datasets.

updated: Tue Aug 23 2022 15:25:44 GMT+0000 (UTC)

published: Tue Aug 23 2022 15:25:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト