Intriguing Properties of Vision Transformers

Muzammal Naseer; Kanchana Ranasinghe; Salman Khan; Munawar Hayat; Fahad Shahbaz Khan; Ming-Hsuan Yang

ビジョントランスフォーマーの興味深い特性

ビジョントランスフォーマー（ViT）は、さまざまなマシンビジョンの問題で優れたパフォーマンスを発揮します。これらのモデルは、コンテキストキューをエンコードするために一連の画像パッチに柔軟に対応できるマルチヘッド自己注意メカニズムに基づいています。重要な問題は、特定のパッチを条件とする画像全体のコンテキストに対応する際のこのような柔軟性が、自然画像の妨害、たとえば、深刻なオクルージョン、ドメインシフト、空間順列、敵対的および自然な摂動の処理をどのように促進できるかです。 3つのViTファミリを含む広範な一連の実験と、高性能畳み込みニューラルネットワーク（CNN）との比較を通じて、この質問を体系的に研究します。 ViTの次の興味深い特性を示し、分析します。（a）トランスフォーマーは、深刻なオクルージョン、摂動、ドメインシフトに対して非常に堅牢です。コンテンツ。（b）オクルージョンに対する堅牢なパフォーマンスは、ローカルテクスチャへのバイアスによるものではなく、ViTはCNNと比較してテクスチャへのバイアスが大幅に少なくなっています。形状ベースの機能をエンコードするように適切にトレーニングされた場合、ViTは、これまで文献で比類のない人間の視覚系に匹敵する形状認識機能を示します。（c）ViTを使用して形状表現をエンコードすると、ピクセルレベルの監視なしで正確なセマンティックセグメンテーションの興味深い結果が得られます。（d）単一のViTモデルの既成の機能を組み合わせて機能アンサンブルを作成できるため、従来の学習パラダイムと数ショットの学習パラダイムの両方で、さまざまな分類データセットにわたって高い精度が得られます。 ViTの効果的な機能は、自己注意メカニズムを介して可能な柔軟で動的な受容野によるものであることを示します。

Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility in attending image-wide context conditioned on a given patch can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via the self-attention mechanism.

updated: Tue Jun 08 2021 13:21:50 GMT+0000 (UTC)

published: Fri May 21 2021 17:59:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト