Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Yifan Xu; Zhijie Zhang; Mengdan Zhang; Kekai Sheng; Ke Li; Weiming Dong; Liqing Zhang; Changsheng Xu; Xing Sun

Evo-ViT：ダイナミックビジョントランスフォーマーの低速-高速トークン進化

ビジョントランスフォーマー（ViT）は最近爆発的な人気を博していますが、莫大な計算コストは依然として深刻な問題です。 ViTの計算の複雑さは、入力シーケンスの長さに関して2次式であるため、計算削減の主流のパラダイムは、トークンの数を削減することです。既存の設計には、プログレッシブ縮小ピラミッドを使用して大きなフィーチャマップの計算を削減する構造化空間圧縮、および冗長トークンを動的に削除する非構造化トークンプルーニングが含まれます。ただし、既存のトークンプルーニングの制限は、次の2つにあります。1）プルーニングによって引き起こされる不完全な空間構造は、最新のディープナロートランスフォーマーで一般的に使用される構造化空間圧縮と互換性がありません。 2）通常、時間のかかる事前トレーニング手順が必要です。制限に取り組み、トークンプルーニングの適用可能なシナリオを拡張するために、ビジョントランスフォーマー向けの自発的な低速-高速トークン進化アプローチであるEvo-ViTを紹介します。具体的には、ビジョントランスフォーマーに固有のシンプルで効果的なグローバルクラスの注意を利用して、構造化されていないインスタンスごとのトークン選択を実行します。次に、選択した情報トークンと非情報トークンを異なる計算パスで更新することを提案します。つまり、低速-高速更新です。低速-高速更新メカニズムは空間構造と情報フローを維持するため、Evo-ViTは、トレーニングプロセスの最初から、フラット構造とディープナロー構造の両方のバニラトランスを加速できます。実験結果は、私たちの方法が画像分類で同等のパフォーマンスを維持しながら、ビジョントランスフォーマーの計算コストを大幅に削減することを示しています。

Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue. Since the computation complexity of ViT is quadratic with respect to the input sequence length, a mainstream paradigm for computation reduction is to reduce the number of tokens. Existing designs include structured spatial compression that uses a progressive shrinking pyramid to reduce the computations of large feature maps, and unstructured token pruning that dynamically drops redundant tokens. However, the limitation of existing token pruning lies in two folds: 1) the incomplete spatial structure caused by pruning is not compatible with structured spatial compression that is commonly used in modern deep-narrow transformers; 2) it usually requires a time-consuming pre-training procedure. To tackle the limitations and expand the applicable scenario of token pruning, we present Evo-ViT, a self-motivated slow-fast token evolution approach for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the simple and effective global class attention that is native to vision transformers. Then, we propose to update the selected informative tokens and uninformative tokens with different computation paths, namely, slow-fast updating. Since slow-fast updating mechanism maintains the spatial structure and information flow, Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that our method significantly reduces the computational cost of vision transformers while maintaining comparable performance on image classification.

updated: Thu Sep 09 2021 13:24:45 GMT+0000 (UTC)

published: Tue Aug 03 2021 09:56:07 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト