PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

Boyu Chen; Peixia Li; Baopu Li; Chuming Li; Lei Bai; Chen Lin; Ming Sun; Junjie Yan; Wanli Ouyang

PSViT：トークンプーリングとアテンションシェアリングによるBetter Vision Transformer

このホワイトペーパーでは、画像認識にビジョントランスフォーマー（ViT）を適用すると、2つのレベルの冗長性が観察されます。まず、ネットワーク全体でトークンの数を固定すると、空間レベルで冗長な機能が生成されます。第二に、異なるトランスレイヤー間のアテンションマップは冗長です。上記の観察に基づいて、PSViTを提案します。これは、冗長性を減らし、機能表現能力を効果的に強化し、速度と精度のトレードオフを向上させるためのトークンプーリングとアテンションシェアリングを備えたViTです。具体的には、PSViTでは、トークンプーリングは、空間レベルでトークンの数を減らす操作として定義できます。さらに、隣接するレイヤー間で強い相関関係を持つアテンションマップを再利用するために、隣接するトランスレイヤー間でアテンションシェアリングが構築されます。次に、さまざまなトークンプーリングおよび注意共有メカニズムの可能な組み合わせのコンパクトなセットが構築されます。提案されたコンパクトセットに基づいて、各レイヤーのトークンの数と注意を共有するレイヤーの選択は、データから自動的に学習されるハイパーパラメーターとして扱うことができます。実験結果は、提案されたスキームがDeiTと比較してImageNet分類で最大6.6％の精度向上を達成できることを示しています。

In this paper, we observe two levels of redundancies when applying vision transformers (ViT) for image recognition. First, fixing the number of tokens through the whole network produces redundant features at the spatial level. Second, the attention maps among different transformer layers are redundant. Based on the observations above, we propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy, effectively enhancing the feature representation ability, and achieving a better speed-accuracy trade-off. Specifically, in our PSViT, token pooling can be defined as the operation that decreases the number of tokens at the spatial level. Besides, attention sharing will be built between the neighboring transformer layers for reusing the attention maps having a strong correlation among adjacent layers. Then, a compact set of the possible combinations for different token pooling and attention sharing mechanisms are constructed. Based on the proposed compact set, the number of tokens in each layer and the choices of layers sharing attention can be treated as hyper-parameters that are learned from data automatically. Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification compared with the DeiT.

updated: Sat Aug 07 2021 11:30:54 GMT+0000 (UTC)

published: Sat Aug 07 2021 11:30:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト