CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention

Wenxiao Wang; Lu Yao; Long Chen; Binbin Lin; Deng Cai; Xiaofei He; Wei Liu

CrossFormer：クロススケールの注目を集める多用途のビジョントランスフォーマー

トランスフォーマーは、コンピュータービジョンタスクの処理において大きな進歩を遂げました。ただし、既存のビジョントランスフォーマーには、視覚入力にとって知覚的に重要な、さまざまなスケールの機能間の相互作用を構築する機能がまだありません。その理由は2つあります。（1）各レイヤーの入力埋め込みは等しいスケールであるため、クロススケールの特徴を抽出できません。（2）計算コストを下げるために、一部のビジョントランスフォーマーは、自己注意モジュール内の隣接する埋め込みをマージします。これにより、埋め込みの小規模な（きめ細かい）機能が犠牲になり、クロススケールの相互作用も無効になります。この目的のために、クロススケール埋め込み層（CEL）と長短距離注意（LSDA）を提案します。一方では、CELは各埋め込みを異なるスケールの複数のパッチとブレンドし、自己注意モジュール自体にクロススケール機能を提供します。一方、LSDAは、自己注意モジュールを短距離モジュールと長距離モジュールに分割します。これにより、計算負荷が軽減されるだけでなく、埋め込みに小規模機能と大規模機能の両方が保持されます。上記の2つのデザインを通じて、私たちはクロススケールの注目を集めています。さらに、一般的な相対位置バイアスを可変サイズの画像に適用できるように、ビジョントランスフォーマーの動的位置バイアスを提案します。クロススケールアテンションモジュールに基づいて、可変サイズの入力に対応するCrossFormerと呼ばれる多用途のビジョンアーキテクチャを構築します。広範な実験により、CrossFormerは、画像分類、オブジェクト検出、インスタンスセグメンテーション、およびセマンティックセグメンテーションタスクで他のビジョントランスフォーマーよりも優れていることが示されています。コードがリリースされました：https：//github.com/cheerss/CrossFormer。

Transformers have made great progress in dealing with computer vision tasks. However, existing vision transformers do not yet possess the ability of building the interactions among features of different scales, which is perceptually important to visual inputs. The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computational cost, some vision transformers merge adjacent embeddings inside the self-attention module, thus sacrificing small-scale (fine-grained) features of the embeddings and also disabling the cross-scale interactions. To this end, we propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA). On the one hand, CEL blends each embedding with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the embeddings. Through the above two designs, we achieve cross-scale attention. Besides, we put forward a dynamic position bias for vision transformers to make the popular relative position bias apply to variable-sized images. Hinging on the cross-scale attention module, we construct a versatile vision architecture, dubbed CrossFormer, which accommodates variable-sized inputs. Extensive experiments show that CrossFormer outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code has been released: https://github.com/cheerss/CrossFormer.

updated: Fri Oct 08 2021 06:56:25 GMT+0000 (UTC)

published: Sat Jul 31 2021 05:52:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト