MLP-Mixer: An all-MLP Architecture for Vision

Ilya Tolstikhin; Neil Houlsby; Alexander Kolesnikov; Lucas Beyer; Xiaohua Zhai; Thomas Unterthiner; Jessica Yung; Daniel Keysers; Jakob Uszkoreit; Mario Lucic; Alexey Dosovitskiy

MLP-Mixer: 視覚のための全MLPアーキテクチャ

畳み込みニューラルネットワーク(CNN)は、コンピュータビジョンの代表的なモデルである。最近では、Vision Transformerのようなアテンションベースのネットワークも普及している。本論文では、コンボリューションとアテンションはどちらも良い性能を発揮するのに十分であるが、どちらも必須ではないことを示す。我々は、多層パーセプトロン(MLP)のみを用いたアーキテクチャであるMLP-Mixerを提示する。MLP-Mixerには、2種類のレイヤーがある。1つは、MLPを画像パッチに独立して適用したもの(つまり、位置ごとの特徴を「混合」する)、もう1つは、MLPをパッチ全体に適用したもの(つまり、空間情報を「混合」する)である。大規模なデータセットや最新の正則化スキームを用いて学習した場合、MLP-Mixerは画像分類のベンチマークで競争力のあるスコアを獲得し、事前学習と推論のコストは最先端のモデルと同等である。これらの結果が、CNNやTransformerの領域を超えて、さらなる研究のきっかけとなることを期待している。

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

updated: Tue May 04 2021 16:17:21 GMT+0000 (UTC)

published: Tue May 04 2021 16:17:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト