Shunted Self-Attention via Multi-Scale Token Aggregation

Sucheng Ren; Daquan Zhou; Shengfeng He; Jiashi Feng; Xinchao Wang

マルチスケールトークンアグリゲーションによるシャントされた自己注意

最近のVisionTransformer〜（ViT）モデルは、自己注意を介して画像パッチまたはトークンの長距離依存性をモデル化する能力のおかげで、さまざまなコンピュータービジョンタスク全体で有望な結果を示しています。ただし、これらのモデルは通常、各レイヤー内の各トークン機能の同様の受容野を指定します。このような制約は、必然的に、マルチスケールの特徴をキャプチャする際の各自己注意層の能力を制限し、それにより、異なるスケールの複数のオブジェクトを含む画像を処理する際のパフォーマンスの低下につながる。この問題に対処するために、ViTが注意層ごとにハイブリッドスケールで注意をモデル化できるようにする、シャント自己注意〜（SSA）と呼ばれる新しい一般的な戦略を提案します。 SSAの重要なアイデアは、異種の受容野サイズをトークンに注入することです。自己注意マトリックスを計算する前に、トークンを選択的にマージして、より大きなオブジェクトの特徴を表し、特定のトークンを保持して、きめ細かい特徴を保持します。この新しいマージスキームにより、サイズの異なるオブジェクト間の関係を自己注意で学習できると同時に、トークン数と計算コストを削減できます。さまざまなタスクにわたる広範な実験により、SSAの優位性が実証されています。具体的には、SSAベースのトランスフォーマーは84.0％のトップ1精度を達成し、モデルサイズと計算コストの半分だけでImageNetの最先端のフォーカルトランスフォーマーを上回り、COCOで1.3 mAP、2.9でフォーカルトランスフォーマーを上回ります。同様のパラメータと計算コストでのADE20KのmIOU。コードはhttps://github.com/OliverRensu/Shunted-Transformerでリリースされています。

Recent Vision Transformer~(ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to their competence in modeling long-range dependencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with multiple objects of different scales. To address this issue, we propose a novel and generic strategy, termed shunted self-attention~(SSA), that allows ViTs to model the attentions at hybrid scales per attention layer. The key idea of SSA is to inject heterogeneous receptive field sizes into tokens: before computing the self-attention matrix, it selectively merges tokens to represent larger object features while keeping certain tokens to preserve fine-grained features. This novel merging scheme enables the self-attention to learn relationships between objects with different sizes and simultaneously reduces the token numbers and the computational cost. Extensive experiments across various tasks demonstrate the superiority of SSA. Specifically, the SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet with only half of the model size and computation cost, and surpasses Focal Transformer by 1.3 mAP on COCO and 2.9 mIOU on ADE20K under similar parameter and computation cost. Code has been released at https://github.com/OliverRensu/Shunted-Transformer.

updated: Tue Nov 30 2021 08:08:47 GMT+0000 (UTC)

published: Tue Nov 30 2021 08:08:47 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト