Local-to-Global Self-Attention in Vision Transformers

Jinpeng Li; Yichao Yan; Shengcai Liao; Xiaokang Yang; Ling Shao

ビジョントランスフォーマーにおけるローカルからグローバルへの自己注意

トランスフォーマーは、コンピュータービジョンタスクで大きな可能性を示しています。高解像度のビジュアルデータでの自己注意の密な計算を回避するために、最近の一部のTransformerモデルは階層設計を採用しており、自己注意はローカルウィンドウ内でのみ計算されます。この設計は効率を大幅に向上させますが、初期段階ではグローバルな機能の推論が不足しています。この作業では、Transformerのマルチパス構造を設計します。これにより、各段階で複数の粒度でローカルからグローバルへの推論が可能になります。提案されたフレームワークは、計算効率が高く、非常に効果的です。計算のオーバーヘッドがわずかに増加することで、私たちのモデルは画像分類とセマンティックセグメンテーションの両方で顕著な改善を達成します。コードはhttps://github.com/ljpadam/LG-Transformerで入手できます。

Transformers have demonstrated great potential in computer vision tasks. To avoid dense computations of self-attentions in high-resolution visual data, some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage. The proposed framework is computationally efficient and highly effective. With a marginal increasement in computational overhead, our model achieves notable improvements in both image classification and semantic segmentation. Code is available at https://github.com/ljpadam/LG-Transformer

updated: Sat Jul 10 2021 02:34:55 GMT+0000 (UTC)

published: Sat Jul 10 2021 02:34:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト