TFS-ViT: Token-Level Feature Stylization for Domain Generalization

Mehrdad Noori; Milad Cheraghalikhani; Ali Bahri; Gustavo A. Vargas Hakim; David Osowiechi; Ismail Ben Ayed; Christian Desrosiers

TFS-ViT: ドメインの一般化のためのトークンレベルの機能のスタイル化

畳み込みニューラルネットワーク (CNN) などの標準的な深層学習モデルには、トレーニング中に見られなかったドメインに一般化する機能がありません。この問題の主な原因は、ソースデータとターゲットデータが同じ iid 分布に由来するというモデルのよくあるがしばしば間違った仮定です。最近、ビジョントランスフォーマー (ViT) は、幅広いコンピュータービジョンタスクで優れたパフォーマンスを発揮しています。ただし、新しいドメインに一般化する能力を調査した研究はほとんどありません。このホワイトペーパーでは、ドメインの一般化のための最初の Token-level Feature Stylization (TFS-ViT) アプローチを紹介します。これは、新しいドメインを合成することにより、目に見えないデータに対する ViT のパフォーマンスを向上させます。私たちのアプローチは、異なるドメインからの画像の正規化統計を混合することにより、トークンの特徴を変換します。クラス (CLS) トークンのアテンションマップを使用して、異なる画像領域に対応するトークンの正規化統計を計算および混合する、アテンションアウェアスタイル化のための新しい戦略により、このアプローチをさらに改善します。提案された方法は、バックボーンモデルの選択に対して柔軟であり、計算の複雑さを無視できるほど増加させることなく、任意の ViT ベースのアーキテクチャに簡単に適用できます。包括的な実験により、私たちのアプローチがドメイン一般化のための 5 つの困難なベンチマークで最先端のパフォーマンスを達成できることが示され、さまざまなタイプのドメインシフトに対処する能力が実証されました。実装は https://github.com/Mehrdad-Noori/TFS-ViT_Token-level_Feature_Stylization で入手できます。

Standard deep learning models such as convolutional neural networks (CNNs) lack the ability of generalizing to domains which have not been seen during training. This problem is mainly due to the common but often wrong assumption of such models that the source and target data come from the same i.i.d. distribution. Recently, Vision Transformers (ViTs) have shown outstanding performance for a broad range of computer vision tasks. However, very few studies have investigated their ability to generalize to new domains. This paper presents a first Token-level Feature Stylization (TFS-ViT) approach for domain generalization, which improves the performance of ViTs to unseen data by synthesizing new domains. Our approach transforms token features by mixing the normalization statistics of images from different domains. We further improve this approach with a novel strategy for attention-aware stylization, which uses the attention maps of class (CLS) tokens to compute and mix normalization statistics of tokens corresponding to different image regions. The proposed method is flexible to the choice of backbone model and can be easily applied to any ViT-based architecture with a negligible increase in computational complexity. Comprehensive experiments show that our approach is able to achieve state-of-the-art performance on five challenging benchmarks for domain generalization, and demonstrate its ability to deal with different types of domain shifts. The implementation is available at: https://github.com/Mehrdad-Noori/TFS-ViT_Token-level_Feature_Stylization.

updated: Wed Mar 29 2023 04:37:04 GMT+0000 (UTC)

published: Tue Mar 28 2023 03:00:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト