Adaptive Token Sampling For Efficient Vision Transformers

Mohsen Fayyaz; Soroush Abbasi Koohpayegani; Farnoush Rezaei Jafari; Sunando Sengupta; Hamid Reza Vaezi Joze; Eric Sommerlade; Hamed Pirsiavash; Juergen Gall

効率的なビジョントランスフォーマーのための適応トークンサンプリング

最先端のビジョントランスフォーマーモデルは、画像分類で有望な結果を達成しますが、計算コストが高く、多くのGFLOPを必要とします。ネットワーク内のトークンの数を減らすことでビジョントランスフォーマーのGFLOPを減らすことができますが、すべての入力画像に最適な設定はありません。したがって、この作業では、既存のビジョントランスフォーマーアーキテクチャにプラグインできる、微分可能なパラメーターフリーのアダプティブトークンサンプラー（ATS）モジュールを紹介します。 ATSは、重要なトークンをスコアリングして適応的にサンプリングすることにより、ビジョントランスフォーマーを強化します。その結果、トークンの数は一定ではなくなり、入力画像ごとに異なります。 ATSを変流器ブロック内の追加レイヤーとして統合することにより、それらを適応数のトークンを備えたはるかに効率的なビジョントランスフォーマーに変換できます。 ATSはパラメーターのないモジュールであるため、プラグアンドプレイモジュールとして既製の事前トレーニング済みビジョントランスフォーマーに追加でき、追加のトレーニングなしでGFLOPを削減できます。さらに、その差別化可能な設計により、ATSを搭載したビジョントランスフォーマーをトレーニングすることもできます。モジュールを複数のSOTAビジョントランスフォーマーに追加することにより、画像とビデオの両方の分類タスクでモジュールの効率を評価します。提案されたモジュールは、ImageNet、Kinetics-400、およびKinetics-600データセットでの精度を維持しながら、計算コスト（GFLOP）を2分の1に削減することでSOTAを改善します。

While state-of-the-art vision transformer models achieve promising results in image classification, they are computationally expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we therefore introduce a differentiable parameter-free Adaptive Token Sampler (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not constant anymore and varies for each input image. By integrating ATS as an additional layer within the current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to the off-the-shelf pre-trained vision transformers as a plug and play module, thus reducing their GFLOPs without any additional training. Moreover, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate the efficiency of our module in both image and video classification tasks by adding it to multiple SOTA vision transformers. Our proposed module improves the SOTA by reducing their computational costs (GFLOPs) by 2X, while preserving their accuracy on the ImageNet, Kinetics-400, and Kinetics-600 datasets.

updated: Tue Jul 26 2022 17:54:59 GMT+0000 (UTC)

published: Tue Nov 30 2021 18:56:57 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト