RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

Yuki Tatsunami; Masato Taki

RaftMLP：注意を払わずに空間的局所性を減らしてどれだけのことができるでしょうか？

過去10年間、CNNはコンピュータービジョンの世界で最高の地位を占めてきましたが、最近、Transformerが増加しています。ただし、自己注意の2次計算コストは、実際のアプリケーションでは深刻な問題になっています。このコンテキストでは、CNNと自己注意のないアーキテクチャに関する多くの研究が行われています。特に、MLP-Mixerは、MLPを使用して設計されたシンプルなアーキテクチャであり、VisionTransformerに匹敵する精度を実現します。ただし、このアーキテクチャの唯一の誘導バイアスは、トークンの埋め込みです。これにより、非畳み込み（または非局所）誘導バイアスをアーキテクチャに組み込む可能性が残ります。そのため、2つの簡単なアイデアを使用して、グローバル相関をキャプチャする機能を利用しながら、誘導バイアスをMLP-Mixerに組み込みました。トークンミキシングブロックを垂直方向と水平方向に分割する方法があります。別の方法は、トークンミキシングのいくつかのチャネル間で空間相関をより密にすることです。このアプローチにより、パラメーターと計算の複雑さを軽減しながら、MLP-Mixerの精度を向上させることができました。 RaftMLP-Sである小さなモデルは、パラメーターと計算あたりの効率の点で、最先端のグローバルMLPベースのモデルに匹敵します。さらに、バイキュービック補間を利用して、グローバルMLPベースのモデルの固定入力画像解像度の問題に取り組みました。これらのモデルは、オブジェクト検出などのダウンストリームタスクのアーキテクチャのバックボーンとして適用できることを示しました。ただし、パフォーマンスはそれほど高くなく、グローバルMLPベースのモデルのダウンストリームタスクにMLP固有のアーキテクチャが必要であると述べています。 PyTorchバージョンのソースコードはhttps://github.com/okojoalg/raft-mlpで入手できます。

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer has been on the rise. However, the quadratic computational cost of self-attention has become a serious problem in practice applications. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple architecture designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. This leaves open the possibility of incorporating a non-convolutional (or non-local) inductive bias into the architecture, so we used two simple ideas to incorporate inductive bias into the MLP-Mixer while taking advantage of its ability to capture global correlations. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation. In addition, we tackled the problem of fixed input image resolution for global MLP-based models by utilizing bicubic interpolation. We demonstrated that these models could be applied as the backbone of architectures for downstream tasks such as object detection. However, it did not have significant performance and mentioned the need for MLP-specific architectures for downstream tasks for global MLP-based models. The source code in PyTorch version is available at https://github.com/okojoalg/raft-mlp.

updated: Tue Nov 23 2021 06:59:50 GMT+0000 (UTC)

published: Mon Aug 09 2021 23:55:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト