New Encoder Learning for Captioning Heavy Rain Images via Semantic Visual Feature Matching

Chang-Hwan Son; Pung-Hwi Ye

セマンティックビジュアル機能マッチングを介して大雨画像をキャプションするための新しいエンコーダ学習

画像のキャプションは、入力画像からシーンを説明するテキストを生成します。晴天時に撮影された高品質の画像用に開発されました。しかし、大雨、雪、濃霧などの悪天候では、雨の筋、雨の蓄積、雪片などによる視界不良により、画質が著しく低下します。これにより、有用な視覚的特徴の抽出が妨げられ、画像のキャプションのパフォーマンスが低下します。実用的な問題に対処するために、この研究では、大雨の画像にキャプションを付けるための新しいエンコーダーを紹介します。中心的なアイデアは、大雨の入力画像から抽出された出力特徴を、単語や文のコンテキストに関連付けられた意味的な視覚的特徴に変換することです。これを実現するために、ターゲットエンコーダーは、最初にエンコーダー-デコーダーフレームワークでトレーニングされ、視覚的特徴を意味語に関連付けます。続いて、大雨モデルに基づく初期再構成サブネットワーク（IRS）を使用して、大雨画像内のオブジェクトが表示されます。次に、IRSを別のセマンティックビジュアル機能マッチングサブネットワーク（SVFMS）と組み合わせて、IRSの出力機能を事前トレーニング済みのターゲットエンコーダーのセマンティックビジュアル機能と照合します。提案されたエンコーダは、IRSとSVFMSの共同学習に基づいています。エンドツーエンドでトレーニングされ、事前にトレーニングされたデコーダーに接続されて画像のキャプションが作成されます。提案されたエンコーダは、大雨の画像からでも単語に関連付けられた意味のある視覚的特徴を生成でき、それによって生成されたキャプションの精度を高めることができることが実験的に実証されています。

Image captioning generates text that describes scenes from input images. It has been developed for high quality images taken in clear weather. However, in bad weather conditions, such as heavy rain, snow, and dense fog, the poor visibility owing to rain streaks, rain accumulation, and snowflakes causes a serious degradation of image quality. This hinders the extraction of useful visual features and results in deteriorated image captioning performance. To address practical issues, this study introduces a new encoder for captioning heavy rain images. The central idea is to transform output features extracted from heavy rain input images into semantic visual features associated with words and sentence context. To achieve this, a target encoder is initially trained in an encoder-decoder framework to associate visual features with semantic words. Subsequently, the objects in a heavy rain image are rendered visible by using an initial reconstruction subnetwork (IRS) based on a heavy rain model. The IRS is then combined with another semantic visual feature matching subnetwork (SVFMS) to match the output features of the IRS with the semantic visual features of the pretrained target encoder. The proposed encoder is based on the joint learning of the IRS and SVFMS. It is is trained in an end-to-end manner, and then connected to the pretrained decoder for image captioning. It is experimentally demonstrated that the proposed encoder can generate semantic visual features associated with words even from heavy rain images, thereby increasing the accuracy of the generated captions.

updated: Wed Sep 15 2021 15:21:14 GMT+0000 (UTC)

published: Fri May 28 2021 11:40:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト