Empirical Analysis of Image Caption Generation using Deep Learning

Aditya Bhattacharya; Eshwar Shamanna Girishekar; Padmakar Anil Deshpande

ディープラーニングを使用した画像キャプション生成の経験的分析

自動画像キャプションは、コンピュータービジョンと自然言語処理で行われる作業の融合を含むディープラーニングのアプリケーションの1つであり、通常、エンコーダーデコーダーアーキテクチャを使用して実行されます。このプロジェクトでは、ResNet101、DenseNet121、VGG19ベースのCNNエンコーダーとアテンションベースのLSTMデコーダーを調査した、さまざまな種類のマルチモーダル画像キャプションネットワークを実装して実験しました。ビームサイズの影響と事前トレーニング済みの単語埋め込みの使用を調査し、それらをベースラインCNNエンコーダーおよびRNNデコーダーアーキテクチャと比較しました。目標は、BLEU、CIDEr、ROUGE、METEORなどのさまざまな評価指標を使用して各アプローチのパフォーマンスを分析することです。また、視覚的注意マップ（VAM）を使用してモデルの説明可能性を調査し、生成されたキャプションの各単語を予測するために最大の貢献をする画像の部分を強調表示しました。

Automated image captioning is one of the applications of Deep Learning which involves fusion of work done in computer vision and natural language processing, and it is typically performed using Encoder-Decoder architectures. In this project, we have implemented and experimented with various flavors of multi-modal image captioning networks where ResNet101, DenseNet121 and VGG19 based CNN Encoders and Attention based LSTM Decoders were explored. We have studied the effect of beam size and the use of pretrained word embeddings and compared them to baseline CNN encoder and RNN decoder architecture. The goal is to analyze the performance of each approach using various evaluation metrics including BLEU, CIDEr, ROUGE and METEOR. We have also explored model explainability using Visual Attention Maps (VAM) to highlight parts of the images which has maximum contribution for predicting each word of the generated caption.

updated: Sat May 22 2021 15:17:21 GMT+0000 (UTC)

published: Fri May 14 2021 05:38:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト