An Image captioning algorithm based on the Hybrid Deep Learning Technique (CNN+GRU)

Rana Adnan Ahmad; Muhammad Azhar; Hina Sattar

Hybrid Deep Learning Technique (CNN+GRU) に基づく画像キャプションアルゴリズム

エンコーダー/デコーダーフレームワークによる画像キャプションは、CNN が主にエンコーダーとして使用され、LSTM がデコーダーとして使用される過去 10 年間で大きな進歩を遂げました。単純な画像の精度という点ではこのような印象的な成果を上げていますが、時間の複雑さと空間の複雑さの効率という点では欠けています。これに加えて、多くの情報とオブジェクトを含む複雑な画像の場合、この CNN-LSTM ペアのパフォーマンスは、画像に示されているシーンのセマンティックな理解が不足しているため、指数関数的に低下します。したがって、これらの問題を考慮に入れるために、意味的コンテキストと時間の複雑さを考慮して処理するための、キャプションから画像への再構成機能用の CNN-GRU エンコーダーデコードフレームワークを提示します。デコーダーの隠れた状態を考慮に入れることで、入力画像とその類似のセマンティック表現が再構築され、セマンティック再構築器からの再構築スコアがモデルのトレーニング中に尤度と共に使用され、生成されたキャプションの品質が評価されます。その結果、デコーダは改善されたセマンティック情報を受け取り、キャプション作成プロセスを強化します。モデルのテスト中に、再構成スコアと対数尤度を組み合わせることで、最も適切なキャプションを選択することもできます。提案されたモデルは、時間の複雑さと精度の点で、画像キャプションの最先端の LSTM-A5 モデルよりも優れています。

Image captioning by the encoder-decoder framework has shown tremendous advancement in the last decade where CNN is mainly used as encoder and LSTM is used as a decoder. Despite such an impressive achievement in terms of accuracy in simple images, it lacks in terms of time complexity and space complexity efficiency. In addition to this, in case of complex images with a lot of information and objects, the performance of this CNN-LSTM pair downgraded exponentially due to the lack of semantic understanding of the scenes presented in the images. Thus, to take these issues into consideration, we present CNN-GRU encoder decode framework for caption-to-image reconstructor to handle the semantic context into consideration as well as the time complexity. By taking the hidden states of the decoder into consideration, the input image and its similar semantic representations is reconstructed and reconstruction scores from a semantic reconstructor are used in conjunction with likelihood during model training to assess the quality of the generated caption. As a result, the decoder receives improved semantic information, enhancing the caption production process. During model testing, combining the reconstruction score and the log-likelihood is also feasible to choose the most appropriate caption. The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.

updated: Fri Jan 06 2023 10:00:06 GMT+0000 (UTC)

published: Fri Jan 06 2023 10:00:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト