Image Captioning based on Feature Refinement and Reflective Decoding

Ghadah Alabduljabbar; Hafida Benhidour; Said Kerrache

機能の絞り込みと反射デコードに基づく画像キャプション

画像のキャプションは、自然言語で画像の説明を自動的に生成するプロセスです。画像のキャプションは、画像内の顕著なオブジェクトだけでなく、それらの属性や相互作用の方法も認識する必要があるため、画像を理解する上での重要な課題の1つです。次に、システムは、自然言語で画像コンテンツを説明する構文的および意味的に正しいキャプションを生成する必要があります。ディープラーニングモデルの大幅な進歩と、大量の画像セットを効果的にエンコードして正しい文を生成する機能により、最近、いくつかのニューラルベースのキャプションアプローチが提案され、それぞれがより高い精度とキャプション品質を達成しようとしています。この論文では、エンコーダがResNet-101を使用して画像から空間的特徴を抽出するエンコーダデコーダベースの画像キャプションシステムを紹介します。この段階の後には、リファインモデルが続きます。このモデルは、アテンションオンアテンションメカニズムを使用して、ターゲット画像オブジェクトの視覚的特徴を抽出し、それらの相互作用を決定します。デコーダーは、アテンションベースのリカレントモジュールとリフレクティブアテンションモジュールで構成されており、視覚的およびテキスト的機能に協調的にアテンションを適用して、長期的なシーケンシャル依存関係をモデル化するデコーダーの機能を強化します。 Flickr30Kで実行された広範な実験は、提案されたアプローチの有効性と生成されたキャプションの高品質を示しています。

Image captioning is the process of automatically generating a description of an image in natural language. Image captioning is one of the significant challenges in image understanding since it requires not only recognizing salient objects in the image but also their attributes and the way they interact. The system must then generate a syntactically and semantically correct caption that describes the image content in natural language. With the significant progress in deep learning models and their ability to effectively encode large sets of images and generate correct sentences, several neural-based captioning approaches have been proposed recently, each trying to achieve better accuracy and caption quality. This paper introduces an encoder-decoder-based image captioning system in which the encoder extracts spatial features from the image using ResNet-101. This stage is followed by a refining model, which uses an attention-on-attention mechanism to extract the visual features of the target image objects, then determine their interactions. The decoder consists of an attention-based recurrent module and a reflective attention module, which collaboratively apply attention to the visual and textual features to enhance the decoder's ability to model long-term sequential dependencies. Extensive experiments performed on Flickr30K, show the effectiveness of the proposed approach and the high quality of the generated captions.

updated: Mon Jul 25 2022 10:30:28 GMT+0000 (UTC)

published: Thu Jun 16 2022 07:56:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト