Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective

Jinjing Zhu; Haotian Bai; Lin Wang

教師なしドメイン適応のためのパッチミックストランスフォーマー: ゲームの視点

最近、挑戦的な教師なしドメイン適応 (UDA) タスクにビジョントランスフォーマー (ViT) を活用する取り組みが行われています。彼らは通常、ViT でクロスアテンションを採用して、直接ドメインを整列させます。ただし、クロスアテンションのパフォーマンスは、ターゲットサンプルの疑似ラベルの品質に大きく依存するため、ドメインギャップが大きくなると効果が低下します。ソースドメインとターゲットドメインを中間ドメインで橋渡しする PMTrans と呼ばれる提案モデルを使用して、ゲーム理論の観点からこの問題を解決します。具体的には、ゲーム理論モデルに基づいて両方のドメインからパッチをサンプリングすることを学習することにより、中間ドメイン、つまり確率分布を効果的に構築する、PatchMix と呼ばれる新しい ViT ベースのモジュールを提案します。このようにして、クロスエントロピー (CE) を最大化するためにソースドメインとターゲットドメインからのパッチを混合することを学習し、特徴空間とラベル空間で 2 つの半教師付き混合損失を利用してそれを最小化します。そのため、UDA のプロセスを、特徴抽出器、分類器、PatchMix を含む 3 人のプレイヤーによる最小最大 CE ゲームとして解釈し、ナッシュ均衡を見つけます。さらに、ViT のアテンションマップを活用して、各パッチのラベルをその重要度で再重み付けすることで、よりドメインを識別できる特徴表現を取得できるようにします。 4 つのベンチマークデータセットで大規模な実験を行った結果、PMTrans は ViT ベースおよび CNN ベースの SoTA メソッドを Office-Home で +3.6%、Office-31 で +1.4%、DomainNet で +17.7% 大幅に上回っていることがわかりました。、それぞれ。

Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for targeted samples, it becomes less effective when the domain gap becomes large. We solve this problem from a game theory's perspective with the proposed model dubbed as PMTrans, which bridges source and target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. This way, it learns to mix the patches from the source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize it. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the Nash Equilibria. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home, +1.4% on Office-31, and +17.7% on DomainNet, respectively.

updated: Thu Mar 23 2023 16:56:01 GMT+0000 (UTC)

published: Thu Mar 23 2023 16:56:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト