AS-MLP: An Axial Shifted MLP Architecture for Vision

Dongze Lian; Zehao Yu; Xing Sun; Shenghua Gao

AS-MLP：ビジョンのためのアキシャルシフトMLPアーキテクチャ

この論文では、アキシャルシフトMLPアーキテクチャ（AS-MLP）を提案します。行列転置と1つのトークンミキシングMLPを介した情報フローに対してグローバル空間機能がエンコードされるMLP-Mixerとは異なり、ローカル機能の通信にさらに注意を払います。 AS-MLPは、フィーチャマップのチャネルを軸方向にシフトすることにより、さまざまな軸方向から情報フローを取得できます。これにより、ローカルの依存関係がキャプチャされます。このような操作により、純粋なMLPアーキテクチャを利用して、CNNのようなアーキテクチャと同じローカル受容野を実現できます。畳み込みカーネルの設計と同じように、AS-MLPのブロックの受容野サイズと拡張なども設計できます。提案されたAS-MLPアーキテクチャを使用すると、モデルはImageNet-1Kデータセットで88Mパラメータと15.2GFLOPで83.3％のトップ1精度を取得します。このようなシンプルで効果的なアーキテクチャは、すべてのMLPベースのアーキテクチャよりも優れており、FLOPがわずかに低い場合でも、トランスベースのアーキテクチャ（Swin Transformerなど）と比較して競争力のあるパフォーマンスを実現します。さらに、AS-MLPは、ダウンストリームタスク（オブジェクト検出やセマンティックセグメンテーションなど）に適用される最初のMLPベースのアーキテクチャでもあります。実験結果も印象的です。提案されたAS-MLPは、COCO検証セットで51.5 mAPを取得し、ADE20Kデータセットで49.5 MS mIoUを取得します。これは、変圧器ベースのアーキテクチャと比較して競争力があります。コードはhttps://github.com/svip-lab/AS-MLPで入手できます。

An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Different from MLP-Mixer, where the global spatial feature is encoded for the information flow through matrix transposition and one token-mixing MLP, we pay more attention to the local features communication. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies. Such an operation enables us to utilize a pure MLP architecture to achieve the same local receptive field as CNN-like architecture. We can also design the receptive field size and dilation of blocks of AS-MLP, etc, just like designing those of convolution kernels. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset. Such a simple yet effective architecture outperforms all MLP-based architectures and achieves competitive performance compared to the transformer-based architectures (e.g., Swin Transformer) even with slightly lower FLOPs. In addition, AS-MLP is also the first MLP-based architecture to be applied to the downstream tasks (e.g., object detection and semantic segmentation). The experimental results are also impressive. Our proposed AS-MLP obtains 51.5 mAP on the COCO validation set and 49.5 MS mIoU on the ADE20K dataset, which is competitive compared to the transformer-based architectures. Code is available at https://github.com/svip-lab/AS-MLP.

updated: Sun Jul 18 2021 08:56:34 GMT+0000 (UTC)

published: Sun Jul 18 2021 08:56:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト