Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Xiangcheng Liu; Tianyi Wu; Guodong Guo

適応型スパース ViT: 自己注意力を最大限に活用することで学習可能な適応型トークンプルーニングに向けて

ビジョントランスフォーマーは、コンピュータービジョンの新しいパラダイムとして登場し、高価な計算コストを伴いながらも優れたパフォーマンスを示します。画像トークンプルーニングは、複雑さがトークン数に関して 2 次であり、背景領域のみを含む多くのトークンが最終的な予測に実際には寄与しないという事実により、ViT 圧縮の主なアプローチの 1 つです。既存の作品は、追加のモジュールに依存して個々のトークンの重要性をスコアリングするか、さまざまな入力インスタンスに対して固定比率の枝刈り戦略を実装します。この研究では、最小限のコストで適応性のあるスパーストークンプルーニングフレームワークを提案します。具体的には、まず、安価なアテンションヘッド重要度加重クラスアテンションスコアリングメカニズムを提案します。次に、学習可能なパラメータが、有益なトークンと重要でないトークンを区別するためのしきい値として挿入されます。トークンアテンションスコアとしきい値を比較することで、不要なトークンを階層的に破棄し、推論を高速化できます。学習可能なしきい値は、予算を意識したトレーニングで最適化され、精度と複雑さのバランスが取れ、さまざまな入力インスタンスに対応するプルーニング構成が実行されます。広範な実験により、私たちのアプローチの有効性が実証されています。私たちの方法では、DeiT-S のスループットが 50% 向上し、トップ 1 の精度の低下は 0.2% のみであり、以前の方法よりも精度とレイテンシの間の優れたトレードオフが達成されています。

Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts that the complexity is quadratic with respect to the token number, and many tokens containing only background regions do not truly contribute to the final prediction. Existing works either rely on additional modules to score the importance of individual tokens, or implement a fixed ratio pruning strategy for different input instances. In this work, we propose an adaptive sparse token pruning framework with a minimal cost. Specifically, we firstly propose an inexpensive attention head importance weighted class attention scoring mechanism. Then, learnable parameters are inserted as thresholds to distinguish informative tokens from unimportant ones. By comparing token attention scores and thresholds, we can discard useless tokens hierarchically and thus accelerate inference. The learnable thresholds are optimized in budget-aware training to balance accuracy and complexity, performing the corresponding pruning configurations for different input instances. Extensive experiments demonstrate the effectiveness of our approach. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy, which achieves a better trade-off between accuracy and latency than the previous methods.

updated: Thu Jul 06 2023 10:49:33 GMT+0000 (UTC)

published: Wed Sep 28 2022 03:07:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト