HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning

Shiming Chen; Guo-Sen Xie; Yang Liu; Qinmu Peng; Baigui Sun; Hao Li; Xinge You; Ling Shao

HSVA：ゼロショット学習のための階層的意味論的視覚適応

ゼロショット学習（ZSL）は、目に見えないクラス認識の問題に取り組み、意味知識を目に見えるクラスから目に見えないクラスに転送します。通常、望ましい知識の伝達を保証するために、共通の（潜在的な）スペースがZSLの視覚的ドメインと意味的ドメインを関連付けるために採用されます。ただし、既存の共通空間学習方法は、ワンステップの適応を通じて分布の不一致を緩和するだけで、セマンティックドメインとビジュアルドメインを調整します。この戦略は、分布と構造の変化の両方を本質的に含む2つのドメインの特徴表現の不均一な性質のため、通常は効果がありません。これに対処し、ZSLを進歩させるために、新しい階層的意味-視覚適応（HSVA）フレームワークを提案します。具体的には、HSVAは、階層的な2段階の適応、つまり構造適応と分布適応を採用することにより、セマンティックドメインとビジュアルドメインを調整します。構造適応ステップでは、2つのタスク固有のエンコーダーを使用して、ソースデータ（ビジュアルドメイン）とターゲットデータ（セマンティックドメイン）を構造に合わせた共通スペースにエンコードします。この目的のために、教師あり敵対的不一致（SAD）モジュールを提案して、2つのタスク固有の分類子の予測間の不一致を敵対的に最小化し、視覚的および意味的特徴の多様体をより緊密に調整します。分布適応ステップでは、潜在的な多変量ガウス分布間のワッサースタイン距離を直接最小化して、共通のエンコーダーを使用して視覚的分布と意味的分布を整列させます。最後に、構造と分布の適応は、2つの部分的に整列された変分オートエンコーダーの下で統一されたフレームワークで導出されます。 4つのベンチマークデータセットでの広範な実験は、HSVAが従来のZSLと一般化されたZSLの両方で優れたパフォーマンスを達成することを示しています。コードはhttps://github.com/shiming-chen/HSVAで入手できます。

Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a common (latent) space is adopted for associating the visual and semantic domains in ZSL. However, existing common space learning methods align the semantic and visual domains by merely mitigating distribution disagreement through one-step adaptation. This strategy is usually ineffective due to the heterogeneous nature of the feature representations in the two domains, which intrinsically contain both distribution and structure variations. To address this and advance ZSL, we propose a novel hierarchical semantic-visual adaptation (HSVA) framework. Specifically, HSVA aligns the semantic and visual domains by adopting a hierarchical two-step adaptation, i.e., structure adaptation and distribution adaptation. In the structure adaptation step, we take two task-specific encoders to encode the source data (visual domain) and the target data (semantic domain) into a structure-aligned common space. To this end, a supervised adversarial discrepancy (SAD) module is proposed to adversarially minimize the discrepancy between the predictions of two task-specific classifiers, thus making the visual and semantic feature manifolds more closely aligned. In the distribution adaptation step, we directly minimize the Wasserstein distance between the latent multivariate Gaussian distributions to align the visual and semantic distributions using a common encoder. Finally, the structure and distribution adaptation are derived in a unified framework under two partially-aligned variational autoencoders. Extensive experiments on four benchmark datasets demonstrate that HSVA achieves superior performance on both conventional and generalized ZSL. The code is available at https://github.com/shiming-chen/HSVA .

updated: Fri Oct 08 2021 07:26:51 GMT+0000 (UTC)

published: Thu Sep 30 2021 14:27:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト