RegionViT: Regional-to-Local Attention for Vision Transformers

Chun-Fu Chen; Rameswar Panda; Quanfu Fan

RegionViT：ビジョントランスフォーマーに対する地域から地域への注目

ビジョントランスフォーマー（ViT）は、画像分類で畳み込みニューラルネットワーク（CNN）に匹敵する結果を達成する強力な機能を最近示しました。ただし、バニラViTは、自然言語処理から同じアーキテクチャを直接継承するだけであり、多くの場合、ビジョンアプリケーション用に最適化されていません。これに動機付けられて、この論文では、ピラミッド構造を採用し、ビジョントランスフォーマーのグローバルな自己注意ではなく、新しい地域からローカルへの注意を採用する新しいアーキテクチャを提案します。より具体的には、私たちのモデルは最初に、異なるパッチサイズの画像から地域トークンとローカルトークンを生成します。各地域トークンは、空間的な場所に基づいてローカルトークンのセットに関連付けられます。地域から地域への注意には2つのステップが含まれます。最初に、地域の自己注意がすべての地域トークンからグローバル情報を抽出し、次に地域の自己注意が1つの地域トークンと関連するローカルトークンの間で自己注意を介して情報を交換します。したがって、ローカルの自己注意がローカルリージョンのスコープを制限している場合でも、グローバル情報を受信できます。画像分類、オブジェクトとキーポイントの検出、セマンティクスのセグメンテーション、アクション認識など、4つのビジョンタスクに関する広範な実験により、私たちのアプローチは、多くの同時作業を含む最先端のViTバリアントよりも優れているか、同等であることが示されています。ソースコードとモデルは、https：//github.com/ibm/regionvitで入手できます。

Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on four vision tasks, including image classification, object and keypoint detection, semantics segmentation and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models are available at https://github.com/ibm/regionvit.

updated: Thu Mar 31 2022 03:20:15 GMT+0000 (UTC)

published: Fri Jun 04 2021 19:57:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト