Learning to Select: A Fully Attentive Approach for Novel Object Captioning

Marco Cagrandi; Marcella Cornia; Matteo Stefanini; Lorenzo Baraldi; Rita Cucchiara

選択することを学ぶ: 新規オブジェクトのキャプションのための完全に注意深いアプローチ

画像のキャプションモデルは、最近、標準のデータセットに適用したときに印象的な結果を示しています。ただし、実際のシナリオへの切り替えは、既存のトレーニングセットではカバーされていないさまざまな視覚的概念のために課題を構成します。このため、トレーニング段階では見えないオブジェクトのキャプションモデルをテストするためのパラダイムとして、新しいオブジェクトキャプション (NOC) が最近登場しました。この論文では、トレーニングセットへの準拠に関係なく、画像の最も関連性の高いオブジェクトを選択し、それに応じて言語モデルの生成プロセスを制限することを学習する NOC の新しいアプローチを紹介します。私たちのアーキテクチャは、制約を組み込んだ場合でも、完全に注意を払い、エンドツーエンドでトレーニングできます。差し止めされた COCO データセットで実験を行い、新しいオブジェクトへの適応性とキャプションの品質の両方の点で、最先端の技術を超える改善を示します。

Image captioning models have lately shown impressive results when applied to standard datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger variety of visual concepts which are not covered in existing training sets. For this reason, novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during the training phase. In this paper, we present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly. Our architecture is fully-attentive and end-to-end trainable, also when incorporating constraints. We perform experiments on the held-out COCO dataset, where we demonstrate improvements over the state of the art, both in terms of adaptability to novel objects and caption quality.

updated: Wed Jun 02 2021 19:11:21 GMT+0000 (UTC)

published: Wed Jun 02 2021 19:11:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト