Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph

Wentian Zhao; Yao Hu; Heda Wang; Xinxiao Wu; Jiebo Luo

マルチモーダル知識グラフによるエンティティ認識画像キャプションの強化

エンティティ対応の画像キャプションは、関連記事の背景知識を利用して、画像に関連する名前付きエンティティとイベントを説明することを目的としています。名前付きエンティティのロングテール分布のため、名前付きエンティティと視覚的手がかりとの関連を学習することは困難であるため、このタスクは依然として困難です。さらに、記事が複雑なため、エンティティ間のきめ細かい関係を抽出して、画像に関する有益なイベントの説明を生成することが困難になります。これらの課題に取り組むために、マルチモーダル知識グラフを構築して、視覚オブジェクトを名前付きエンティティに関連付け、Webから収集した外部知識の助けを借りてエンティティ間の関係を同時にキャプチャする新しいアプローチを提案します。具体的には、記事から名前付きエンティティとその関係を抽出してテキストサブグラフを作成し、画像内のオブジェクトを検出して画像サブグラフを作成します。これらの2つのサブグラフを接続するために、ウィキペディアのエントリと対応する画像を含む知識ベースを使用してトレーニングされたクロスモーダルエンティティマッチングモジュールを提案します。最後に、マルチモーダル知識グラフは、グラフ注意メカニズムを介してキャプションモデルに統合されます。 GoodNewsとNYTimes800kの両方のデータセットでの広範な実験は、私たちの方法の有効性を示しています。

Entity-aware image captioning aims to describe named entities and events related to the image by utilizing the background knowledge in the associated article. This task remains challenging as it is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities. Furthermore, the complexity of the article brings difficulty in extracting fine-grained relationships between entities to generate informative event descriptions about the image. To tackle these challenges, we propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities and capture the relationship between entities simultaneously with the help of external knowledge collected from the web. Specifically, we build a text sub-graph by extracting named entities and their relationships from the article, and build an image sub-graph by detecting the objects in the image. To connect these two sub-graphs, we propose a cross-modal entity matching module trained using a knowledge base that contains Wikipedia entries and the corresponding images. Finally, the multi-modal knowledge graph is integrated into the captioning model via a graph attention mechanism. Extensive experiments on both GoodNews and NYTimes800k datasets demonstrate the effectiveness of our method.

updated: Mon Jul 26 2021 05:50:41 GMT+0000 (UTC)

published: Mon Jul 26 2021 05:50:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト