Global Interaction Modelling in Vision Transformer via Super Tokens

Ammarah Farooq; Muhammad Awais; Sara Ahmed; Josef Kittler

スーパートークンを介したVisionTransformerのグローバルインタラクションモデリング

コンピュータビジョンにおけるTransformerアーキテクチャの人気により、研究の焦点は計算効率の高い設計の開発にシフトしています。ウィンドウベースのローカルアテンションは、最近の作品で採用されている主要な手法の1つです。これらの方法は、非常に小さいパッチサイズと小さい埋め込み次元で始まり、次にストライド畳み込み（パッチマージ）を実行して、フィーチャマップサイズを縮小し、埋め込み次元を増やします。これにより、ピラミッド型の畳み込みニューラルネットワーク（CNN）のような設計が形成されます。この作業では、ローカルウィンドウとスーパートークンと呼ばれる特別なトークンを自己注意のために採用する新しい等方性アーキテクチャを提示することにより、トランスフォーマーのローカルおよびグローバル情報モデリングを調査します。具体的には、単一のスーパートークンが各画像ウィンドウに割り当てられ、そのウィンドウの豊富なローカル詳細をキャプチャします。これらのトークンは、クロスウィンドウ通信とグローバル表現学習に使用されます。したがって、ほとんどの学習は上位層の画像パッチ（N）に依存せず、クラスの埋め込みはスーパートークン（N / M ^ 2）のみに基づいて学習されます。ここでM ^ 2はウィンドウサイズです。 Imagenet-1Kの標準的な画像分類では、提案されたスーパートークンベースのトランスフォーマー（STT-S25）は83.5％の精度を達成します。これは、パラメーターの数が約半分（49M）で推論時間が2倍のSwinトランスフォーマー（Swin-B）と同等です。スループット。提案されたスーパートークントランスフォーマーは、視覚認識タスクのための軽量で有望なバックボーンを提供します。

With the popularity of Transformer architectures in computer vision, the research focus has shifted towards developing computationally efficient designs. Window-based local attention is one of the major techniques being adopted in recent works. These methods begin with very small patch size and small embedding dimensions and then perform strided convolution (patch merging) in order to reduce the feature map size and increase embedding dimensions, hence, forming a pyramidal Convolutional Neural Network (CNN) like design. In this work, we investigate local and global information modelling in transformers by presenting a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention. Specifically, a single Super token is assigned to each image window which captures the rich local details for that window. These tokens are then employed for cross-window communication and global representation learning. Hence, most of the learning is independent of the image patches (N) in the higher layers, and the class embedding is learned solely based on the Super tokens (N/M^2) where M^2 is the window size. In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5% accuracy which is equivalent to Swin transformer (Swin-B) with circa half the number of parameters (49M) and double the inference time throughput. The proposed Super token transformer offers a lightweight and promising backbone for visual recognition tasks.

updated: Thu Nov 25 2021 16:22:57 GMT+0000 (UTC)

published: Thu Nov 25 2021 16:22:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト