Vision Xformers: Efficient Attention for Image Classification

Pranav Jeevan; Amit Sethi

Vision Xformers：画像分類のための効率的な注意

分類の精度を損なうことなくトレーニング可能なパラメーターの数を減らすために、ビジョントランスフォーマー（ViT）に3つの改善を提案します。初期のViTアーキテクチャの2つの欠点、つまり注意メカニズムの2次ボトルネックと、2次元画像構造の展開に依存するアーキテクチャの誘導バイアスの欠如に対処します。線形注意メカニズムは、視覚タスクでのトランスフォーマーモデルの適用を制限する2次の複雑さのボトルネックを克服します。二次アテンションを、Vision X-former（ViX）を作成する線形複雑性のPerformer、Linformer、Nyströmformerなどの効率的なトランスフォーマーに置き換えることにより、ViTアーキテクチャを変更してより長いシーケンスデータを処理します。 ViXの3つのバージョンすべてが、はるかに少ないパラメーターと計算リソースを使用しながら、画像分類に関してViTよりも正確である可能性があることを示します。また、FNetおよび多層パーセプトロン（MLP）ミキサーとのパフォーマンスを比較します。さらに、ViXで最初の線形埋め込み層を畳み込み層に置き換えると、パフォーマンスがさらに向上することを示します。さらに、LeViT、畳み込みビジョントランスフォーマー（CvT）、コンパクト畳み込みトランスフォーマー（CCT）、プーリングベースのビジョントランスフォーマー（PiT）などの最近のビジョントランスフォーマーモデルでのテストでは、注意をNyströmformerまたはPerformerに置き換えるとGPUの使用量とメモリが節約されることが示されています分類精度を損なうことなく。また、ViTの標準の学習可能な1D位置埋め込みをロータリー位置埋め込み（RoPE）に置き換えると、精度がさらに向上することも示しています。これらの変更を組み込むことで、データとコンピューティングリソースが限られているトランスフォーマーにアクセスできるようにすることで、トランスフォーマーを民主化できます。

We propose three improvements to vision transformers (ViT) to reduce the number of trainable parameters without compromising classification accuracy. We address two shortcomings of the early ViT architectures -- quadratic bottleneck of the attention mechanism and the lack of an inductive bias in their architectures that rely on unrolling the two-dimensional image structure. Linear attention mechanisms overcome the bottleneck of quadratic complexity, which restricts application of transformer models in vision tasks. We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers, such as Performer, Linformer and Nyströmformer of linear complexity creating Vision X-formers (ViX). We show that all three versions of ViX may be more accurate than ViT for image classification while using far fewer parameters and computational resources. We also compare their performance with FNet and multi-layer perceptron (MLP) mixer. We further show that replacing the initial linear embedding layer by convolutional layers in ViX further increases their performance. Furthermore, our tests on recent vision transformer models, such as LeViT, Convolutional vision Transformer (CvT), Compact Convolutional Transformer (CCT) and Pooling-based Vision Transformer (PiT) show that replacing the attention with Nyströmformer or Performer saves GPU usage and memory without deteriorating the classification accuracy. We also show that replacing the standard learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE) give further improvements in accuracy. Incorporating these changes can democratize transformers by making them accessible to those with limited data and computing resources.

updated: Tue Aug 03 2021 03:58:35 GMT+0000 (UTC)

published: Mon Jul 05 2021 19:24:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト