CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Wenxiao Wang; Wei Chen; Qibo Qiu; Long Chen; Boxi Wu; Binbin Lin; Xiaofei He; Wei Liu

CrossFormer++: クロススケールの注目を集める万能なビジョントランスフォーマー

さまざまなスケールの機能は視覚入力にとって知覚的に重要ですが、既存のビジョントランスフォーマーはまだそれらを明示的に利用していません。この目的のために、まずクロススケールビジョントランスフォーマー CrossFormer を提案します。クロススケール埋め込み層 (CEL) と長短距離注意 (LSDA) が導入されています。一方では、CEL は各トークンを異なるスケールの複数のパッチとブレンドし、自己注意モジュール自体にクロススケール機能を提供します。一方、LSDA は自己注意モジュールを短距離モジュールと長距離モジュールに分割します。これにより、計算負荷が軽減されるだけでなく、トークンに小規模および大規模の両方の機能が保持されます。さらに、CrossFormer での実験を通じて、ビジョントランスフォーマーのパフォーマンスに影響を与える別の 2 つの問題、すなわち自己注意マップの拡大と振幅の爆発を観察しました。したがって、2つの問題をそれぞれ軽減するために、プログレッシブグループサイズ（PGS）パラダイムと振幅冷却層（ACL）をさらに提案します。 PGS と ACL を組み込んだ CrossFormer は、CrossFormer++ と呼ばれます。広範な実験により、CrossFormer++ は、画像分類、オブジェクト検出、インスタンスセグメンテーション、セマンティックセグメンテーションタスクで他のビジョントランスフォーマーよりも優れていることが示されています。コードは https://github.com/cheerss/CrossFormer で入手できます。

While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers' performance, i.e. the enlarging self-attention maps and amplitude explosion. Thus, we further propose a progressive group size (PGS) paradigm and an amplitude cooling layer (ACL) to alleviate the two issues, respectively. The CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive experiments show that CrossFormer++ outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code will be available at: https://github.com/cheerss/CrossFormer.

updated: Mon Mar 13 2023 07:54:29 GMT+0000 (UTC)

published: Mon Mar 13 2023 07:54:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト