Token Pooling in Vision Transformers

Dmitrii Marin; Jen-Hao Rick Chang; Anurag Ranjan; Anish Prabhu; Mohammad Rastegari; Oncel Tuzel

VisionTransformerでのトークンプーリング

多くのアプリケーションで最近成功したにもかかわらず、ビジョントランスフォーマーの高い計算要件により、リソースに制約のある設定での使用が制限されています。多くの既存の方法は注意の二次複雑さを改善しますが、ほとんどのビジョントランスフォーマーでは、自己注意は主要な計算のボトルネックではありません。たとえば、計算の80％以上が完全に接続されたレイヤーに費やされます。すべてのレイヤーの計算の複雑さを改善するために、トークンプーリングと呼ばれる新しいトークンダウンサンプリング方法を提案します。これは、画像と中間トークン表現の冗長性を効率的に活用します。穏やかな仮定の下で、softmax-attentionが高次元のローパス（平滑化）フィルターとして機能することを示します。したがって、その出力には、計算コストと精度の間のより良いトレードオフを達成するためにプルーニングできる冗長性が含まれています。私たちの新しい手法は、ダウンサンプリングによって引き起こされる再構築エラーを最小限に抑えることで、トークンのセットを正確に近似します。この最適化問題は、費用対効果の高いクラスタリングによって解決します。以前のダウンサンプリング方法を厳密に分析して比較します。私たちの実験は、トークンプーリングが、最先端のダウンサンプリングよりもコストと精度のトレードオフを大幅に改善することを示しています。トークンプーリングは、多くのアーキテクチャに役立つシンプルで効果的な演算子です。 DeiTに適用すると、42％少ない計算で同じImageNetトップ1の精度を実現します。

Despite the recent success in many applications, the high computational requirements of vision transformers limit their use in resource-constrained settings. While many existing methods improve the quadratic complexity of attention, in most vision transformers, self-attention is not the major computation bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers. To improve the computational complexity of all layers, we propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations. We show that, under mild assumptions, softmax-attention acts as a high-dimensional low-pass (smoothing) filter. Thus, its output contains redundancy that can be pruned to achieve a better trade-off between the computational cost and accuracy. Our new technique accurately approximates a set of tokens by minimizing the reconstruction error caused by downsampling. We solve this optimization problem via cost-efficient clustering. We rigorously analyze and compare to prior downsampling methods. Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling. Token Pooling is a simple and effective operator that can benefit many architectures. Applied to DeiT, it achieves the same ImageNet top-1 accuracy using 42% fewer computations.

updated: Mon Oct 11 2021 15:17:21 GMT+0000 (UTC)

published: Fri Oct 08 2021 02:22:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト