Bornon: Bengali Image Captioning with Transformer-based Deep learning approach

Faisal Muhammad Shah; Mayeesha Humaira; Md Abidur Rahman Khan Jim; Amit Saha Ami; Shimul Paul

Bornon：Transformerベースのディープラーニングアプローチによるベンガル語の画像キャプション

CNNがエンコーダーとして使用され、RNNのようなシーケンスジェネレーターがデコーダーとして使用されるエンコーダーデコーダーベースのアプローチを使用した画像キャプションは、非常に効果的であることが証明されています。ただし、この方法には、シーケンスを順番に処理する必要があるという欠点があります。この欠点を克服するために、一部の研究者はTransformerモデルを利用して、英語のデータセットを使用して画像からキャプションを生成しました。ただし、トランスモデルを使用してベンガル語でキャプションを生成したものはありませんでした。その結果、3つの異なるベンガル語データセットを利用して、Transformerモデルを使用して画像からベンガル語のキャプションを生成しました。さらに、トランスフォーマーベースのモデルのパフォーマンスを視覚的注意ベースのエンコーダーデコーダーアプローチと比較しました。最後に、トランスフォーマーベースのモデルの結果を、異なるベンガル語の画像キャプションデータセットを使用した他のモデルと比較しました。

Image captioning using Encoder-Decoder based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback that is sequence needs to be processed in order. To overcome this drawback some researcher has utilized the Transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the Transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based Encoder-Decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.

updated: Sat Sep 11 2021 08:29:26 GMT+0000 (UTC)

published: Sat Sep 11 2021 08:29:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト