Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

Xuran Pan; Tianzhu Ye; Zhuofan Xia; Shiji Song; Gao Huang

スライドトランスフォーマー: ローカルセルフアテンションを備えた階層型ビジョントランスフォーマー

自己注意メカニズムは、グローバルコンテキストからの適応的特徴抽出を可能にする Vision Transformer (ViT) の最近の進歩における重要な要素でした。ただし、既存の自己注意方法は、疎なグローバル注意またはウィンドウ注意を採用して計算の複雑さを軽減します。これにより、ローカル機能の学習が損なわれたり、手作りの設計が適用されたりする可能性があります。対照的に、各クエリの受容野をそれ自体の隣接するピクセルに制限する局所的注意は、畳み込みと自己注意の両方の利点、すなわち局所誘導バイアスと動的特徴選択を享受します。それにもかかわらず、現在のローカルアテンションモジュールは、非効率的な Im2Col 関数を使用するか、CUDA サポートなしでデバイスに一般化するのが難しい特定の CUDA カーネルに依存しています。このホワイトペーパーでは、一般的な畳み込み演算を活用して高い効率性、柔軟性、および一般化可能性を実現する新しいローカルアテンションモジュール、Slide Attention を提案します。具体的には、最初に列ベースの Im2Col 関数を新しい行ベースの観点から再解釈し、Depthwise Convolution を効率的な代用として使用します。これに基づいて、再パラメータ化技術に基づく変形シフトモジュールを提案します。これは、固定キー/値の位置をローカル領域の変形した特徴にさらに緩和します。このようにして、私たちのモジュールは効率的かつ柔軟な方法でローカルアテンションパラダイムを実現します。広範な実験により、当社のスライドアテンションモジュールがさまざまな高度な Vision Transformer モデルに適用可能であり、さまざまなハードウェアデバイスと互換性があり、包括的なベンチマークで一貫して改善されたパフォーマンスを達成することが示されています。コードは https://github.com/LeapLabTHU/Slide-Transformer で入手できます。

Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT), which enables adaptive feature extraction from global contexts. However, existing self-attention methods either adopt sparse global attention or window attention to reduce the computation complexity, which may compromise the local feature learning or subject to some handcrafted designs. In contrast, local attention, which restricts the receptive field of each query to its own neighboring pixels, enjoys the benefits of both convolution and self-attention, namely local inductive bias and dynamic feature selection. Nevertheless, current local attention modules either use inefficient Im2Col function or rely on specific CUDA kernels that are hard to generalize to devices without CUDA support. In this paper, we propose a novel local attention module, Slide Attention, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability. Specifically, we first re-interpret the column-based Im2Col function from a new row-based perspective and use Depthwise Convolution as an efficient substitution. On this basis, we propose a deformed shifting module based on the re-parameterization technique, which further relaxes the fixed key/value positions to deformed features in the local region. In this way, our module realizes the local attention paradigm in both efficient and flexible manner. Extensive experiments show that our slide attention module is applicable to a variety of advanced Vision Transformer models and compatible with various hardware devices, and achieves consistently improved performances on comprehensive benchmarks. Code is available at https://github.com/LeapLabTHU/Slide-Transformer.

updated: Sun Apr 09 2023 13:37:59 GMT+0000 (UTC)

published: Sun Apr 09 2023 13:37:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト