Visual News: Benchmark and Challenges in News Image Captioning

Fuxiao Liu; Yinghan Wang; Tianlu Wang; Vicente Ordonez

ビジュアルニュース：ニュース画像キャプションのベンチマークと課題

ニュース画像のキャプションのタスクのためのエンティティ認識モデルであるVisualNewsCaptionerを提案します。また、100万を超えるニュース画像と、関連するニュース記事、画像のキャプション、作成者情報、およびその他のメタデータで構成される大規模なベンチマークであるVisualNewsも紹介します。標準の画像キャプションタスクとは異なり、ニュース画像は、人、場所、イベントが最も重要な状況を表しています。私たちが提案する方法は、視覚的特徴とテキスト的特徴を効果的に組み合わせて、イベントやエンティティなどのより豊富な情報を含むキャプションを生成できます。より具体的には、Transformerアーキテクチャに基づいて構築されたモデルには、名前付きエンティティをより正確に生成するように設計された、新しいマルチモーダル機能融合手法と注意メカニズムがさらに装備されています。私たちの方法は、競合する方法よりもわずかに優れた予測結果を達成しながら、はるかに少ないパラメーターを使用します。より大きく多様なビジュアルニュースデータセットは、ニュース画像のキャプションにおける残りの課題をさらに浮き彫りにします。

We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.

updated: Mon Sep 13 2021 18:53:35 GMT+0000 (UTC)

published: Thu Oct 08 2020 03:07:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト