On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Lei Li; Xiang Chen; Shuofei Qiao; Feiyu Xiong; Huajun Chen; Ningyu Zhang

視覚強化関係抽出における画像の役割の分析について

マルチモーダルな関係抽出は、ナレッジグラフの構築に不可欠なタスクです。このホワイトペーパーでは、ビジュアルシーングラフの不正確な情報が不十分なモーダルアライメントウェイトにつながり、パフォーマンスがさらに低下することを示す詳細な経験的分析を行います。さらに、視覚シャッフルの実験は、現在のアプローチでは視覚情報を十分に活用できない可能性があることを示しています。上記の観察に基づいて、マルチモーダル関係抽出用の Transformer に基づく暗黙的な細粒度マルチモーダルアライメントを備えた強力なベースラインをさらに提案します。実験結果は、私たちの方法のより良いパフォーマンスを示しています。コードは https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal で入手できます。

Multimodal relation extraction is an essential task for knowledge graph construction. In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights, further degrading performance. Moreover, the visual shuffle experiments illustrate that the current approaches may not take full advantage of visual information. Based on the above observation, we further propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction. Experimental results demonstrate the better performance of our method. Codes are available at https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal.

updated: Mon Nov 14 2022 16:39:24 GMT+0000 (UTC)

published: Mon Nov 14 2022 16:39:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト