KVT: k-NN Attention for Boosting Vision Transformers

Pichao Wang; Xue Wang; Fan Wang; Ming Lin; Shuning Chang; Hao Li; Rong Jin

KVT：ビジョントランスフォーマーをブーストするためのk-NNの注意

畳み込みニューラルネットワーク（CNN）は、局所性と並進不変性をキャプチャする能力があるため、何年もの間コンピュータビジョンを支配してきました。最近、多くのビジョントランスアーキテクチャが提案されており、それらは有望なパフォーマンスを示しています。ビジョントランスフォーマーの重要なコンポーネントは、完全に接続された自己注意です。これは、長距離依存関係のモデリングにおいてCNNよりも強力です。ただし、現在の密な自己注意はすべての画像パッチ（トークン）を使用して注意マトリックスを計算するため、画像パッチの局所性を無視し、ノイズの多いトークン（背景の乱雑さやオクルージョンなど）が含まれる可能性があり、トレーニングプロセスが遅くなり、劣化する可能性がありますパフォーマンスの。これらの問題に対処するために、ビジョントランスフォーマーをブーストするためのk-NN注意を提案します。具体的には、アテンションマトリックスの計算にすべてのトークンを含める代わりに、各クエリのキーから上位k個の類似トークンのみを選択してアテンションマップを計算します。提案されたk-NN注意は、畳み込み演算を導入することなく、CNNのローカルバイアスを自然に継承します。これは、近くのトークンが他のトークンよりも類似している傾向があるためです。さらに、k-NNの注意により、長距離相関の調査が可能になると同時に、画像全体から最も類似したトークンを選択することにより、無関係なトークンが除外されます。その単純さにもかかわらず、理論的にも経験的にも、k-NN注意がトレーニングの高速化と入力トークンからのノイズの抽出に強力であることを確認します。 11の異なるビジョントランスアーキテクチャを使用して広範な実験を行い、提案されたk-NNアテンションが既存のトランスアーキテクチャと連携して予測パフォーマンスを向上できることを確認します。コードはhttps://github.com/damo-cv/KVTで入手できます。

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose the k-NN attention for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the k-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that k-NN attention is powerful in speeding up training and distilling noise from input tokens. Extensive experiments are conducted by using 11 different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance. The codes are available at https://github.com/damo-cv/KVT.

updated: Fri Jul 22 2022 23:18:16 GMT+0000 (UTC)

published: Fri May 28 2021 06:49:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト