Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

Huangjie Zheng; Pengcheng He; Weizhu Chen; Mingyuan Zhou

ミキシングとシフト：ビジョンMLPにおけるグローバルおよびローカルの依存関係の活用

トークンミキシング多層パーセプトロン（MLP）モデルは、シンプルなアーキテクチャと比較的小さな計算コストで、コンピュータービジョンタスクで競争力のあるパフォーマンスを示しています。計算効率の維持における彼らの成功は、主に、計算量が多いことが多い自己注意の使用を回避することに起因しますが、これは、トークンをグローバルとローカルの両方で混合できないという犠牲を払っています。この論文では、自己注意なしにグローバルとローカルの両方の依存関係を活用するために、混合に使用されるローカル受容野のサイズを空間シフトの量に対して増加させるMix-Shift-MLP（MS-MLP）を提示します。従来のミキシングおよびシフト手法に加えて、MS-MLPは、隣接するトークンと離れたトークンの両方を細粒度から粗粒度レベルまで混合し、シフト操作を介してそれらを収集します。これは、グローバルトークンとローカルトークン間の相互作用に直接貢献します。 MS-MLPは実装が簡単で、複数のビジョンベンチマークで競争力のあるパフォーマンスを実現します。たとえば、8,500万個のパラメータを持つMS-MLPは、ImageNet-1Kで83.8％のトップ1分類精度を達成します。さらに、MS-MLPをSwin Transformerなどの最先端のVisionTransformerと組み合わせることにより、MS-MLPが3つの異なるモデルスケールでさらに改善されることを示します。たとえば、Swin-を使用したImageNet-1K分類で0.5％向上します。 B。コードはhttps://github.com/JegZheng/MS-MLPで入手できます。

Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks with a simple architecture and relatively small computational cost. Their success in maintaining computation efficiency is mainly attributed to avoiding the use of self-attention that is often computationally heavy, yet this is at the expense of not being able to mix tokens both globally and locally. In this paper, to exploit both global and local dependencies without self-attention, we present Mix-Shift-MLP (MS-MLP) which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting. In addition to conventional mixing and shifting techniques, MS-MLP mixes both neighboring and distant tokens from fine- to coarse-grained levels and then gathers them via a shifting operation. This directly contributes to the interactions between global and local tokens. Being simple to implement, MS-MLP achieves competitive performance in multiple vision benchmarks. For example, an MS-MLP with 85 million parameters achieves 83.8% top-1 classification accuracy on ImageNet-1K. Moreover, by combining MS-MLP with state-of-the-art Vision Transformers such as the Swin Transformer, we show MS-MLP achieves further improvements on three different model scales, e.g., by 0.5% on ImageNet-1K classification with Swin-B. The code is available at: https://github.com/JegZheng/MS-MLP.

updated: Mon Feb 14 2022 06:53:48 GMT+0000 (UTC)

published: Mon Feb 14 2022 06:53:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト