Semantic Disentangling Generalized Zero-ShotLearning

Zhi Chen; Ruihong Qiu; Sen Wang; Zi Huang; Jingjing Li; Zheng Zhang

セマンティック解きほぐし一般化ゼロショットラーニング

一般化ゼロショット学習（GZSL）は、表示されているカテゴリと表示されていないカテゴリの両方から画像を認識することを目的としています。ほとんどのGZSLメソッドは、通常、タグや属性などのセマンティック情報全体、および表示されているクラスの視覚的機能を活用することにより、表示されていないクラスのCNN視覚的機能を合成することを学習します。視覚的特徴の中で、セマンティック一貫性とセマンティック非関連性の2種類の特徴を定義して、それぞれ属性で注釈が付けられた画像の特性と、画像の情報量の少ない特徴を表します。理想的には、対応する特性が意味情報に注釈されていないため、意味に関係のない情報を、意味と視覚の関係によって、見られるクラスから見えないクラスに転送することは不可能です。したがって、見られるクラスの特徴は、意味論的モダリティと視覚的モダリティの間の整合を妨げる可能性のある意味論的に無関係な情報を含む可能性があるため、視覚的特徴合成の基盤は必ずしも堅固ではありません。この問題に対処するために、この論文では、エンコーダ-デコーダアーキテクチャに基づく新しい特徴解きほぐしアプローチを提案し、画像の視覚的特徴をこれら2つの潜在的特徴空間に因数分解して、対応する表現を抽出します。さらに、関係モジュールがこのアーキテクチャに組み込まれてセマンティックとビジュアルの関係を学習し、2つの潜在的な表現の解きほぐしを促進するために総相関ペナルティが適用されます。提案されたモデルは、見えた画像の固有の特徴をキャプチャする高品質の意味的に一貫した表現を抽出することを目的としています。これは、見えないクラスの生成ターゲットとしてさらに採用されます。 7つのGZSLベンチマークデータセットで実施された広範な実験により、提案の最先端のパフォーマンスが検証されました。

Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories. Most GZSL methods typically learn to synthesize CNN visual features for the unseen classes by leveraging entire semantic information, e.g., tags and attributes, and the visual features of the seen classes. Within the visual features, we define two types of features that semantic-consistent and semantic-unrelated to represent the characteristics of images annotated in attributes and less informative features of images respectively. Ideally, the semantic-unrelated information is impossible to transfer by semantic-visual relationship from seen classes to unseen classes, as the corresponding characteristics are not annotated in the semantic information. Thus, the foundation of the visual feature synthesis is not always solid as the features of the seen classes may involve semantic-unrelated information that could interfere with the alignment between semantic and visual modalities. To address this issue, in this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture to factorize visual features of images into these two latent feature spaces to extract corresponding representations. Furthermore, a relation module is incorporated into this architecture to learn semantic-visual relationship, whilst a total correlation penalty is applied to encourage the disentanglement of two latent representations. The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images, which are further taken as the generation target for unseen classes. Extensive experiments conducted on seven GZSL benchmark datasets have verified the state-of-the-art performance of the proposal.

updated: Wed Jan 20 2021 05:46:21 GMT+0000 (UTC)

published: Wed Jan 20 2021 05:46:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト