Scene-based Factored Attention for Image Captioning

画像キャプションに対するシーンベースのファクタリングされた注意

画像のキャプションは、マルチメディアコミュニティで増え続ける研究の注目を集めています。この目的のために、ほとんどの最先端の作品は、注目すべきメカニズムを備えたエンコーダーデコーダーフレームワークに依存しており、これは著しい進歩を遂げています。ただし、このようなフレームワークでは、視覚情報に付随するシーンの概念を考慮していないため、キャプション生成における文の偏りにつながり、それに応じてパフォーマンスが低下します。このようなシーンの概念は、より高いレベルの視覚的セマンティクスをキャプチャし、画像を記述する際の重要な手がかりとなると主張しています。本論文では、画像キャプションのための新しいシーンベースのファクタリングされた注意モジュールを提案します。具体的には、提案されたモジュールは、まずシーンの概念を明示的に因数分解された重みに埋め込み、入力画像から抽出された視覚情報に対応します。次に、適応LSTMを使用して、特定のシーンタイプのキャプションを生成します。 Microsoft COCOベンチマークの実験結果は、提案されたシーンベースのアテンションモジュールがモデルのパフォーマンスを大幅に改善し、さまざまな評価指標の下で最先端のアプローチよりも優れていることを示しています。

Image captioning has attracted ever-increasing research attention in the multimedia community. To this end, most cutting-edge works rely on an encoder-decoder framework with attention mechanisms, which have achieved remarkable progress. However, such a framework does not consider scene concepts to attend visual information, which leads to sentence bias in caption generation and defects the performance correspondingly. We argue that such scene concepts capture higher-level visual semantics and serve as an important cue in describing images. In this paper, we propose a novel scene-based factored attention module for image captioning. Specifically, the proposed module first embeds the scene concepts into factored weights explicitly and attends the visual information extracted from the input image. Then, an adaptive LSTM is used to generate captions for specific scene types. Experimental results on Microsoft COCO benchmark show that the proposed scene-based attention module improves model performance a lot, which outperforms the state-of-the-art approaches under various evaluation metrics.

updated: Mon Sep 02 2019 16:11:16 GMT+0000 (UTC)

published: Wed Aug 07 2019 13:43:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト