TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

Jinyu Yang; Jingjing Liu; Ning Xu; Junzhou Huang

TVT：教師なしドメイン適応のための転送可能なビジョントランスフォーマー

教師なしドメイン適応（UDA）は、ラベル付きのソースドメインから学習した知識をラベルなしのターゲットドメインに転送することを目的としています。以前の作業は、主に畳み込みニューラルネットワーク（CNN）に基づいて構築され、ドメイン不変表現を学習します。ビジョントランスフォーマー（ViT）をビジョンタスクに適用する最近の指数関数的な増加により、クロスドメイン知識を適応させるViTの機能は、しかしながら、文献では未踏のままです。このギャップを埋めるために、このペーパーではまず、さまざまなドメイン適応タスクでのViTの転送可能性を包括的に調査します。驚いたことに、ViTは、CNNベースの対応物よりも優れた転送可能性を示し、マージンが大きく、敵対的な適応を組み込むことでパフォーマンスをさらに向上させることができます。それにもかかわらず、CNNベースの適応戦略を直接使用しても、知識の伝達に重要な役割を果たすViTの固有のメリット（注意メカニズムや連続画像表現など）を利用できません。これを改善するために、ドメイン適応のためにViTの転送可能性を十分に活用するために、統合フレームワーク、つまりTransferable Vision Transformer（TVT）を提案します。具体的には、Transferability Adaptation Module（TAM）と呼ばれる斬新で効果的なユニットを微妙に考案します。学習した転送可能性を注意ブロックに注入することにより、TAMはViTに転送可能な機能と識別可能な機能の両方に焦点を合わせさせます。さらに、識別クラスタリングを活用して、敵対的なドメインアラインメント中に損なわれる機能の多様性と分離を強化します。その多様性を検証するために、4つのベンチマークでTVTの広範な調査を実施し、実験結果は、TVTが既存の最先端のUDA手法と比較して大幅な改善を達成していることを示しています。

Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain. Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations. With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge, however, remains unexplored in the literature. To fill this gap, this paper first comprehensively investigates the transferability of ViT on a variety of domain adaptation tasks. Surprisingly, ViT demonstrates superior transferability over its CNNs-based counterparts with a large margin, while the performance can be further improved by incorporating adversarial adaptation. Notwithstanding, directly using CNNs-based adaptation strategies fails to take the advantage of ViT's intrinsic merits (e.g., attention mechanism and sequential image representation) which play an important role in knowledge transfer. To remedy this, we propose an unified framework, namely Transferable Vision Transformer (TVT), to fully exploit the transferability of ViT for domain adaptation. Specifically, we delicately devise a novel and effective unit, which we term Transferability Adaption Module (TAM). By injecting learned transferabilities into attention blocks, TAM compels ViT focus on both transferable and discriminative features. Besides, we leverage discriminative clustering to enhance feature diversity and separation which are undermined during adversarial domain alignment. To verify its versatility, we perform extensive studies of TVT on four benchmarks and the experimental results demonstrate that TVT attains significant improvements compared to existing state-of-the-art UDA methods.

updated: Fri Nov 26 2021 18:24:44 GMT+0000 (UTC)

published: Thu Aug 12 2021 22:37:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト