Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Chuanxin Tang; Yucheng Zhao; Guangting Wang; Chong Luo; Wenxuan Xie; Wenjun Zeng

画像認識のためのスパースMLP：自己注意は本当に必要ですか？

トランスフォーマーは、コンピュータービジョンの分野で生まれました。この作業では、Transformerのコア自己注意モジュールが画像認識で優れたパフォーマンスを達成するための鍵であるかどうかを調査します。この目的のために、既存のMLPベースのビジョンモデルに基づいて、sMLPNetと呼ばれる注意のないネットワークを構築します。具体的には、トークンミキシングステップのMLPモジュールを新しいスパースMLP（sMLP）モジュールに置き換えます。 2D画像トークンの場合、sMLPは軸方向に沿って1D MLPを適用し、パラメーターは行または列間で共有されます。スパース接続と重み共有により、sMLPモジュールはモデルパラメータの数と計算の複雑さを大幅に削減し、MLPのようなモデルのパフォーマンスを悩ます一般的な過剰適合の問題を回避します。 ImageNet-1Kデータセットでのみトレーニングした場合、提案されたsMLPNetは、わずか24Mのパラメーターで81.9％のトップ1精度を達成します。これは、同じモデルサイズの制約の下でほとんどのCNNおよびビジョントランスフォーマーよりもはるかに優れています。 66Mのパラメーターにスケールアップすると、sMLPNetは83.4％のトップ1精度を達成します。これは、最先端のSwinTransformerと同等です。 sMLPNetの成功は、自己注意メカニズムが必ずしもコンピュータビジョンの特効薬ではないことを示唆しています。コードは公開されます。

Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns. By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and computational complexity, avoiding the common over-fitting problem that plagues the performance of MLP-like models. When only trained on the ImageNet-1K dataset, the proposed sMLPNet achieves 81.9% top-1 accuracy with only 24M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer. The success of sMLPNet suggests that the self-attention mechanism is not necessarily a silver bullet in computer vision. Code will be made publicly available.

updated: Sun Sep 12 2021 04:05:15 GMT+0000 (UTC)

published: Sun Sep 12 2021 04:05:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト