Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network

Md Aminul Haque Palash; MD Abdullah Al Nasim; Sourav Saha; Faria Afrin; Raisa Mallik; Sathishkumar Samiappan

CNN-Transformerベースのエンコーダー-デコーダーネットワークによるBangla画像キャプションの生成

自動画像キャプションは、構文的に作成し、コンテキストを使用して自然言語で画像のテキスト記述の正確さを検証するという終わりのない取り組みです。既存のベンガル画像キャプション（BIC）研究全体で使用されているエンコーダーデコーダー構造は、エンコーダーの入力として抽象的な画像特徴ベクトルを利用していました。画像から特徴を抽出するための事前トレーニング済みのResNet-101モデル画像エンコーダを備えたアテンションメカニズムを備えた新しいトランスベースのアーキテクチャを提案します。実験は、私たちの手法の言語デコーダーがキャプション内のきめ細かい情報をキャプチャし、画像機能と組み合わせて、BanglaLekhaImageCaptionsデータセットに正確で多様なキャプションを生成することを示しています。私たちのアプローチは、既存のすべてのベンガル画像キャプション作業を上回り、BLEU-1で0.694、BLEU-2で0.630、BLEU-3で0.582、METEORで0.337のスコアを付けることで、新しいベンチマークを設定します。

Automatic Image Captioning is the never-ending effort of creating syntactically and validating the accuracy of textual descriptions of an image in natural language with context. The encoder-decoder structure used throughout existing Bengali Image Captioning (BIC) research utilized abstract image feature vectors as the encoder's input. We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images. Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions on the BanglaLekhaImageCaptions dataset. Our approach outperforms all existing Bengali Image Captioning work and sets a new benchmark by scoring 0.694 on BLEU-1, 0.630 on BLEU-2, 0.582 on BLEU-3, and 0.337 on METEOR.

updated: Sun Oct 24 2021 13:33:23 GMT+0000 (UTC)

published: Sun Oct 24 2021 13:33:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト