RegionViT: Regional-to-Local Attention for Vision Transformers

Chun-Fu Chen; Rameswar Panda; Quanfu Fan

RegionViT: ビジョントランスフォーマーに対する地域間の注目

ビジョントランスフォーマー (ViT) は最近、画像分類で畳み込みニューラルネットワーク (CNN) に匹敵する結果を達成する強力な能力を示しました。ただし、バニラ ViT は、同じアーキテクチャを自然言語処理から直接継承するだけであり、ビジョンアプリケーション向けに最適化されていないことがよくあります。これに動機付けられて、この論文では、ピラミッド構造を採用し、ビジョン変換器でグローバルな自己注意ではなく、新しい地域からローカルへの注意を採用する新しいアーキテクチャを提案します。具体的には、このモデルはまず、パッチサイズが異なる画像からリージョントークンとローカルトークンを生成します。各リージョントークンは、空間的位置に基づいてローカルトークンのセットに関連付けられます。地域間の注意には 2 つのステップが含まれます。最初に、地域の自己注意がすべての地域トークン間でグローバル情報を抽出し、次に、地域の自己注意が自己注意を介して 1 つの地域トークンと関連するローカルトークンの間で情報を交換します。したがって、ローカルな自己注意は、スコープをローカルな領域に限定しても、グローバルな情報を受け取ることができます。画像分類、物体検出、行動認識を含む 3 つのビジョンタスクに関する広範な実験により、私たちのアプローチは、多くの同時作業を含む最先端の ViT バリアントよりも優れているか、同等であることが示されています。ソースコードとモデルは公開されます。

Vision transformer (ViT) has recently showed its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on three vision tasks, including image classification, object detection and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models will be publicly available.

updated: Fri Jun 04 2021 19:57:11 GMT+0000 (UTC)

published: Fri Jun 04 2021 19:57:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト