Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Wei Wei; Ling Cheng; Xianling Mao; Guangyou Zhou; Feida Zhu

Stack-VS：画像キャプション生成のためのスタックされた視覚的意味的注意

最近、自動画像キャプション生成は、マルチモーダル翻訳タスクの作業の重要な焦点となっています。既存のアプローチは、2つのクラス、つまりトップダウンとボトムアップに大まかに分類できます。前者は画像情報（視覚レベル機能と呼ばれる）をキャプションに直接転送し、後者は抽出された単語（セマンティックレベルと呼ばれる属性）を使用して説明を生成します。ただし、以前の方法は通常、1段デコーダに基づいているか、画像キャプション生成のために視覚レベルまたは意味レベルの情報の一部を部分的に利用します。このホワイトペーパーでは、ボトムアップとトップダウンの注意モデルを組み合わせて視覚レベルの両方を効果的に処理することにより、豊富なファインゲイン画像キャプション生成のための革新的なマルチステージアーキテクチャ（Stack-VSと呼ばれる）を提案します入力画像の意味レベルの情報。具体的には、それぞれが2つのLSTMレイヤーを含むデコーダーセルのシーケンスで構成される、よく設計された新しいスタックデコーダーモデルも提案します。各レイヤーは、視覚レベルの特徴ベクトルとセマンティックの両方で注意の重みを再最適化するために対話的に動作します細かいゲインの画像キャプションを生成するためのレベル属性の埋め込み。人気のあるベンチマークデータセットMSCOCOでの広範な実験により、さまざまな評価指標の大幅な改善が示されました。つまり、最新技術と比較して、BLEU-4 / CIDEr / SPICEスコアの改善はそれぞれ0.372、1.226、0.216です。。

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, i.e., top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semanticlevel attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-gained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-gained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4/CIDEr/SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-arts.

updated: Thu Sep 05 2019 15:41:53 GMT+0000 (UTC)

published: Thu Sep 05 2019 15:41:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト