RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?

Yuki Tatsunami; Masato Taki

RaftMLP：MLPベースのモデルはコンピュータービジョンに勝つことを夢見ていますか？

過去10年間、CNNはコンピュータービジョンの世界で最高の地位を占めてきましたが、最近、Transformerが増加しています。しかし、自己注意の二次計算コストは、実践の深刻な問題になっています。このコンテキストでは、CNNと自己注意のないアーキテクチャに関する多くの研究が行われています。特に、MLP-Mixerは、MLPを使用して設計された単純なアイデアであり、VisionTransformerに匹敵する精度を実現します。ただし、このアーキテクチャの唯一の誘導バイアスは、トークンの埋め込みです。したがって、アーキテクチャ自体に非畳み込み誘導バイアスを組み込む可能性はまだあり、2つの単純なアイデアを使用して誘導バイアスを組み込みました。トークンミキシングブロックを垂直方向と水平方向に分割する方法があります。別の方法は、トークンミキシングのいくつかのチャネル間で空間相関をより密にすることです。このアプローチにより、パラメーターと計算の複雑さを軽減しながら、MLP-Mixerの精度を向上させることができました。他のMLPベースのモデルと比較して、RaftMLPという名前の提案されたモデルは、計算の複雑さ、パラメーターの数、および実際のメモリ使用量のバランスが取れています。さらに、私たちの研究は、MLPベースのモデルが誘導バイアスを採用することによってCNNを置き換える可能性があることを示しています。 PyTorchバージョンのソースコードはhttps://github.com/okojoalg/raft-mlpで入手できます。

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at https://github.com/okojoalg/raft-mlp.

updated: Mon Aug 09 2021 23:55:24 GMT+0000 (UTC)

published: Mon Aug 09 2021 23:55:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト