New Image Captioning Encoder via Semantic Visual Feature Matching for Heavy Rain Images

Chang-Hwan Son; Pung-Hwi Ye

大雨画像のセマンティックビジュアル機能マッチングによる新しい画像キャプションエンコーダー

画像キャプションは、入力画像からシーンを説明するテキストを生成します。晴天時に高画質で撮影できるように開発されました。しかし、大雨や雪、濃霧などの悪天候下では、雨すじや雨の蓄積、雪片による視界不良により、画質が著しく低下します。これにより、有用な視覚的特徴の抽出が妨げられ、画像のキャプションのパフォーマンスが低下します。実用的な問題に対処するために、この研究では大雨の画像にキャプションを付けるための新しいエンコーダーを紹介します。中心的な考え方は、大雨の入力画像から抽出された出力特徴を、単語と文のコンテキストに関連付けられた意味のある視覚的特徴に変換することです。これを実現するために、ターゲットエンコーダーは、最初にエンコーダーデコーダーフレームワークでトレーニングされ、視覚的特徴をセマンティックワードに関連付けます。その後、豪雨モデルに基づく初期再構成サブネットワーク (IRS) を使用して、豪雨画像内のオブジェクトを可視化します。次に、IRS は別のセマンティックビジュアルフィーチャーマッチングサブネットワーク (SVFMS) と結合され、IRS の出力フィーチャーを事前トレーニング済みのターゲットエンコーダーのセマンティックビジュアルフィーチャーと照合します。提案されたエンコーダは、IRS と SVFMS の共同学習に基づいています。エンドツーエンドの方法でトレーニングされ、画像のキャプション用に事前トレーニング済みのデコーダーに接続されます。提案されたエンコーダーは、大雨の画像からでも単語に関連付けられた意味的な視覚的特徴を生成できることを実験的に示し、生成されたキャプションの精度を向上させます。

Image captioning generates text that describes scenes from input images. It has been developed for high quality images taken in clear weather. However, in bad weather conditions, such as heavy rain, snow, and dense fog, the poor visibility owing to rain streaks, rain accumulation, and snowflakes causes a serious degradation of image quality. This hinders the extraction of useful visual features and results in deteriorated image captioning performance. To address practical issues, this study introduces a new encoder for captioning heavy rain images. The central idea is to transform output features extracted from heavy rain input images into semantic visual features associated with words and sentence context. To achieve this, a target encoder is initially trained in an encoder-decoder framework to associate visual features with semantic words. Subsequently, the objects in a heavy rain image are rendered visible by using an initial reconstruction subnetwork (IRS) based on a heavy rain model. The IRS is then combined with another semantic visual feature matching subnetwork (SVFMS) to match the output features of the IRS with the semantic visual features of the pretrained target encoder. The proposed encoder is based on the joint learning of the IRS and SVFMS. It is is trained in an end-to-end manner, and then connected to the pretrained decoder for image captioning. It is experimentally demonstrated that the proposed encoder can generate semantic visual features associated with words even from heavy rain images, thereby increasing the accuracy of the generated captions.

updated: Mon May 31 2021 03:03:21 GMT+0000 (UTC)

published: Fri May 28 2021 11:40:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト