What Makes for Hierarchical Vision Transformer?

Yuxin Fang; Xinggang Wang; Rui Wu; Wenyu Liu

階層型ビジョントランスフォーマーの特徴は何ですか？

最近の研究によると、インターリーブされたオーバーラップされていないウィンドウベースの自己注意\＆シフトウィンドウ操作のマクロアーキテクチャを備えた階層型Vision Transformerは、さまざまな視覚認識タスクで最先端のパフォーマンスを達成でき、ユビキタスに挑戦します。密にスライドしたカーネルを使用した畳み込みニューラルネットワーク（CNN）。ほとんどのフォローアップ作業は、ウィンドウベースの情報集約の事実上の標準として自己注意を扱いながら、シフトウィンドウ操作を他の種類のクロスウィンドウ通信パラダイムに置き換えようとします。この原稿では、階層型ビジョントランスフォーマーが強力なパフォーマンスを達成するための唯一の選択肢が自己注意であるかどうか、およびさまざまな種類のクロスウィンドウ通信の効果について疑問を投げかけています。この目的のために、自己注意レイヤーを恥ずかしいほど単純な線形マッピングレイヤーに置き換え、LinMapperと呼ばれる結果として得られる概念実証アーキテクチャは、ImageNet-1k画像認識で非常に強力なパフォーマンスを実現できます。さらに、LinMapperは、画像認識から事前にトレーニングされた表現をより有効に活用でき、オブジェクト検出やインスタンスセグメンテーションなどのダウンストリームの高密度予測タスクで優れた転送学習プロパティを示すことがわかります。また、さまざまなクロスウィンドウ通信アプローチの下で、重複していない各ウィンドウ内のコンテンツ集約に対する自己注意の代替案を実験します。これらはすべて、同様の競争力のある結果をもたらします。私たちの研究は、特定の集約レイヤーまたはクロスウィンドウ通信の特定の手段以外のSwinモデルファミリのマクロアーキテクチャが、その強力なパフォーマンスの原因である可能性があり、ユビキタスCNNの高密度スライディングウィンドウパラダイムへの真の挑戦者であることを明らかにしています。コードとモデルは、将来の研究を容易にするために公開されます。

Recent studies indicate that hierarchical Vision Transformer with a macro architecture of interleaved non-overlapped window-based self-attention \& shifted-window operation is able to achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. Most follow-up works attempt to replace the shifted-window operation with other kinds of cross-window communication paradigms, while treating self-attention as the de-facto standard for window-based information aggregation. In this manuscript, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and the effects of different kinds of cross-window communication. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed as LinMapper can achieve very strong performance in ImageNet-1k image recognition. Moreover, we find that LinMapper is able to better leverage the pre-trained representations from image recognition and demonstrates excellent transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches, which all give similar competitive results. Our study reveals that the macro architecture of Swin model families, other than specific aggregation layers or specific means of cross-window communication, may be more responsible for its strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm. Code and models will be publicly available to facilitate future research.

updated: Fri Sep 10 2021 03:04:13 GMT+0000 (UTC)

published: Mon Jul 05 2021 17:59:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト