MetaFormer Is Actually What You Need for Vision

Weihao Yu; Mi Luo; Pan Zhou; Chenyang Si; Yichen Zhou; Xinchao Wang; Jiashi Feng; Shuicheng Yan

MetaFormerは実際にビジョンに必要なものです

トランスフォーマーは、コンピュータービジョンタスクで大きな可能性を示しています。一般的な信念は、彼らの注意ベースのトークンミキサーモジュールが彼らの能力に最も貢献しているということです。ただし、最近の作業では、Transformersのアテンションベースのモジュールを空間MLPに置き換えることができ、結果のモデルは依然として非常に良好に機能することが示されています。この観察に基づいて、特定のトークンミキサーモジュールではなく、トランスフォーマーの一般的なアーキテクチャがモデルのパフォーマンスにとってより重要であると仮定します。これを検証するために、トランスフォーマーのアテンションモジュールを、基本的なトークンミキシングのみを実行するための恥ずかしいほど単純な空間プーリング演算子に意図的に置き換えます。驚いたことに、PoolFormerと呼ばれる派生モデルが、複数のコンピュータービジョンタスクで競争力のあるパフォーマンスを実現していることがわかりました。たとえば、ImageNet-1Kでは、PoolFormerは82.1％のトップ1精度を達成し、適切に調整されたVision Transformer/MLPのようなベースラインDeiT-B/ResMLP-B24を0.3％/ 1.1％の精度で上回り、パラメーターが35％/ 52％少なくなっています。 MACが50％/ 62％少なくなります。 PoolFormerの有効性は私たちの仮説を検証し、トークンミキサーを指定せずにトランスフォーマーから抽象化された一般的なアーキテクチャである「MetaFormer」の概念を開始するように促します。広範な実験に基づいて、MetaFormerは、ビジョンタスクに関する最近のTransformerおよびMLPのようなモデルで優れた結果を達成する上で重要な役割を果たしていると主張します。この作業では、トークンミキサーモジュールに焦点を当てるのではなく、MetaFormerの改善に専念する将来の研究が必要です。さらに、提案されたPoolFormerは、将来のMetaFormerアーキテクチャ設計の開始ベースラインとして機能する可能性があります。コードはhttps://github.com/sail-sg/poolformerで入手できます。

Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in Transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the Transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in Transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned Vision Transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 50%/62% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of "MetaFormer", a general architecture abstracted from Transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent Transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design. Code is available at https://github.com/sail-sg/poolformer.

updated: Mon Jul 04 2022 17:59:58 GMT+0000 (UTC)

published: Mon Nov 22 2021 18:52:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト