ATS: Adaptive Token Sampling For Efficient Vision Transformers

Mohsen Fayyaz; Soroush Abbasi Kouhpayegani; Farnoush Rezaei Jafari; Eric Sommerlade; Hamid Reza Vaezi Joze; Hamed Pirsiavash; Juergen Gall

ATS：効率的なビジョントランスフォーマーのための適応トークンサンプリング

最先端のビジョントランスフォーマーモデルは、画像分類に関して有望な結果を達成しますが、計算コストが非常に高く、多くのGFLOPを必要とします。ビジョントランスフォーマーのGFLOPは、ネットワーク内のトークンの数を減らすことで減らすことができますが、すべての入力画像に最適な設定はありません。したがって、この作業では、既存のビジョントランスフォーマーアーキテクチャにプラグインできる、微分可能なパラメーターフリーのアダプティブトークンサンプリング（ATS）モジュールを紹介します。 ATSは、重要なトークンをスコアリングして適応的にサンプリングすることにより、ビジョントランスフォーマーを強化します。その結果、トークンの数は静的ではなくなりましたが、入力画像ごとに異なります。 ATSを変流器ブロック内の追加レイヤーとして統合することにより、それらを適応数のトークンを備えたはるかに効率的なビジョントランスフォーマーに変換できます。 ATSはパラメーターのないモジュールであるため、プラグアンドプレイモジュールとして既成の事前トレーニング済みビジョントランスフォーマーに追加でき、追加のトレーニングなしでGFLOPを削減できます。ただし、その差別化可能な設計により、ATSを備えたビジョントランスフォーマーをトレーニングすることもできます。複数の最先端のビジョントランスフォーマーにモジュールを追加することにより、ImageNetデータセットでモジュールを評価します。私たちの評価は、提案されたモジュールが、精度を維持しながら計算コスト（GFLOP）を37％削減することにより、最先端技術を向上させることを示しています。

While state-of-the-art vision transformer models achieve promising results for image classification, they are computationally very expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we, therefore, introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not anymore static but it varies for each input image. By integrating ATS as an additional layer within current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to off-the-shelf pretrained vision transformers as a plug-and-play module, thus reducing their GFLOPs without any additional training. However, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate our module on the ImageNet dataset by adding it to multiple state-of-the-art vision transformers. Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37% while preserving the accuracy.

updated: Tue Nov 30 2021 18:56:57 GMT+0000 (UTC)

published: Tue Nov 30 2021 18:56:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト