Vicinity Vision Transformer

Weixuan Sun; Zhen Qin; Hui Deng; Jianyuan Wang; Yi Zhang; Kaihao Zhang; Nick Barnes; Stan Birchfield; Lingpeng Kong; Yiran Zhong

近隣ビジョントランスフォーマー

ビジョントランスフォーマーは、多くのコンピュータービジョンタスクで大きな成功を収めています。ただし、その中心的なコンポーネントであるsoftmaxの注意は、計算の複雑さとメモリフットプリントの両方が二次式であるため、ビジョントランスフォーマーが高解像度の画像にスケールアップすることを禁止しています。同様の問題を軽減するために自然言語処理（NLP）タスクに線形注意が導入されましたが、既存の線形注意をビジョントランスフォーマーに直接適用しても、満足のいく結果が得られない場合があります。この問題を調査したところ、コンピュータービジョンタスクはNLPタスクよりもローカル情報に重点を置いていることがわかりました。この観察に基づいて、線形の複雑さを持つビジョントランスフォーマーに局所性バイアスを導入する周辺注意を提示します。具体的には、各画像パッチについて、隣接するパッチによって測定された2Dマンハッタン距離に基づいて注意の重みを調整します。この場合、隣接するパッチは、遠くのパッチよりも強い注目を集めます。さらに、Vicinity Attentionでは、トークンの長さをフィーチャの寸法よりもはるかに大きくして効率の利点を示す必要があるため、精度を低下させることなくフィーチャの寸法を縮小する新しいVicinity Vision Transformer（VVT）構造をさらに提案します。 CIFAR100、ImageNet1K、およびADE20Kデータセットで広範な実験を実行して、メソッドの有効性を検証します。私たちの方法では、入力解像度が上がると、以前のトランスベースおよびコンボリューションベースのネットワークよりもGFlopsの成長速度が遅くなります。特に、私たちのアプローチは、以前の方法よりも50％少ないパラメータで、最先端の画像分類精度を実現します。

Vision transformers have shown great success on numerous computer vision tasks. However, its central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Although linear attention was introduced in natural language processing (NLP) tasks to mitigate a similar issue, directly applying existing linear attention to vision transformers may not lead to satisfactory results. We investigate this problem and find that computer vision tasks focus more on local information compared with NLP tasks. Based on this observation, we present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance measured by its neighbouring patches. In this case, the neighbouring patches will receive stronger attention than far-away patches. Moreover, since our Vicinity Attention requires the token length to be much larger than the feature dimension to show its efficiency advantages, we further propose a new Vicinity Vision Transformer (VVT) structure to reduce the feature dimension without degenerating the accuracy. We perform extensive experiments on the CIFAR100, ImageNet1K, and ADE20K datasets to validate the effectiveness of our method. Our method has a slower growth rate of GFlops than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.

updated: Tue Jun 21 2022 17:33:53 GMT+0000 (UTC)

published: Tue Jun 21 2022 17:33:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト