Making Vision Transformers Efficient from A Token Sparsification View

Shuning Chang; Pichao Wang; Ming Lin; Fan Wang; David Junhao Zhang; Rong Jin; Mike Zheng Shou

トークンのスパース化ビューからビジョントランスフォーマーを効率化する

トークンの数に対する二次的な計算の複雑さは、ビジョントランスフォーマー (ViT) の実際のアプリケーションを制限します。いくつかの研究では、効率的な ViT を実現するために冗長なトークンを削除することが提案されています。ただし、これらの方法は一般に、(i) 劇的な精度低下、(ii) ローカルビジョントランスフォーマーでのアプリケーションの難しさ、および (iii) ダウンストリームタスク用の非汎用ネットワークに悩まされます。この作業では、効率的なグローバルおよびローカルビジョントランスフォーマー向けの新しいセマンティックトークン ViT (STViT) を提案します。これは、ダウンストリームタスクのバックボーンとして機能するように修正することもできます。セマンティックトークンはクラスターセンターを表し、イメージトークンを空間にプールすることによって初期化され、グローバルまたはローカルのセマンティック情報を適応的に表すことができるアテンションによって回復されます。クラスタプロパティにより、グローバルビジョントランスフォーマーとローカルビジョントランスフォーマーの両方で、いくつかのセマンティックトークンが膨大なイメージトークンと同じ効果を達成できます。たとえば、DeiT-(Tiny,Small,Base) の 16 個のセマンティックトークンのみが、100% 以上の推論速度の向上とほぼ 60% の FLOP の削減で同じ精度を達成できます。 Swin-(Tiny,Small,Base) では、各ウィンドウに 16 個のセマンティックトークンを使用して、さらに約 20% 高速化し、精度をわずかに向上させることができます。画像分類で大きな成功を収めただけでなく、この方法をビデオ認識にも拡張しています。さらに、STViT に基づいて詳細な空間情報を復元するための STViT-R(ecover) ネットワークを設計し、以前のトークンのスパース化方法では無力だったダウンストリームタスクで機能させます。実験では、バックボーンの FLOP が 30% 以上削減され、オブジェクト検出とインスタンスセグメンテーションにおいて、元のネットワークと比較して、私たちの方法が競争力のある結果を達成できることが示されています。

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.

updated: Wed Mar 15 2023 15:12:36 GMT+0000 (UTC)

published: Wed Mar 15 2023 15:12:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト