Dual Attention on Pyramid Feature Maps for Image Captioning

Litao Yu; Jian Zhang; Qiang Wu

画像キャプションのためのピラミッド特徴マップに関する二重の注意

画像から自然な文章を生成することは、マルチメディアにおける視覚的意味的理解のための基本的な学習タスクです。この論文では、ピラミッド画像の特徴マップに二重の注意を適用して、視覚と意味の相関関係を完全に調査し、生成された文の品質を向上させることを提案します。具体的には、RNNコントローラーの非表示状態によって提供されるコンテキスト情報を十分に考慮することで、ピラミッドの注意は、画像内の視覚的に示され、意味的に一貫した領域をより適切にローカライズできます。一方、コンテキスト情報は、チャネルごとの依存関係を学習することで機能コンポーネントの重要性を再調整し、コンテンツの説明を改善するために視覚的機能の識別力を向上させるのに役立ちます。 Flickr8K、Flickr30K、MS COCOの3つの有名なデータセットで包括的な実験を行い、画像から説明的で滑らかな自然な文章を生成するという印象的な結果を達成しました。畳み込み視覚機能またはより有益なボトムアップ注意機能のいずれかを使用して、当社の複合キャプションモデルは、単一モデルモードで非常に有望なパフォーマンスを実現します。提案されているピラミッドアテンションとデュアルアテンションの方法は高度にモジュール化されており、さまざまな画像キャプションモジュールに挿入してパフォーマンスをさらに向上させることができます。

Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in generating descriptive and smooth natural sentences from images. Using either convolution visual features or more informative bottom-up attention features, our composite captioning model achieves very promising performance in a single-model mode. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules to further improve the performance.

updated: Mon Nov 02 2020 23:42:34 GMT+0000 (UTC)

published: Mon Nov 02 2020 23:42:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト