Patches Are All You Need?

Asher Trockman; J. Zico Kolter

パッチはあなたが必要とするすべてですか？

畳み込みネットワークは長年にわたってビジョンタスクの主要なアーキテクチャでしたが、最近の実験では、Transformerベースのモデル、特にVision Transformer（ViT）が一部の設定でパフォーマンスを超える可能性があることが示されています。ただし、トランスフォーマーの自己注意レイヤーの2次実行時間のため、ViTは、より大きな画像サイズに適用するために、画像の小さな領域を単一の入力特徴にグループ化するパッチ埋め込みを使用する必要があります。これは疑問を投げかけます：ViTのパフォーマンスは本質的により強力なTransformerアーキテクチャによるものですか、それとも少なくとも部分的には入力表現としてパッチを使用することによるものですか？この論文では、後者のいくつかの証拠を提示します。具体的には、入力としてパッチを直接操作するという点で、ViTおよびさらに基本的なMLP-Mixerと精神的に類似した非常に単純なモデルであるConvMixerを提案します。、空間次元とチャネル次元の混合を分離し、ネットワーク全体で同じサイズと解像度を維持します。ただし、対照的に、ConvMixerは標準の畳み込みのみを使用してミキシングステップを実行します。その単純さにもかかわらず、ConvMixerは、ResNetなどの従来のビジョンモデルよりも優れていることに加えて、ViT、MLP-Mixer、および同様のパラメーター数とデータセットサイズのそれらのバリアントの一部よりも優れていることを示します。私たちのコードはhttps://github.com/locuslab/convmixerで入手できます。

Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at https://github.com/locuslab/convmixer.

updated: Mon Jan 24 2022 16:42:56 GMT+0000 (UTC)

published: Mon Jan 24 2022 16:42:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト