Vision Xformers: Efficient Attention for Image Classification

Pranav Jeevan; Amit Sethi

Vision Xformers：画像分類のための効率的な注意

トランスフォーマーは自然言語処理に最適なニューラルアーキテクチャになっていますが、コンピュータービジョンで畳み込みニューラルネットワークと競合するには、桁違いに多くのトレーニングデータ、GPUメモリ、および計算が必要です。トランスフォーマーのアテンションメカニズムは、入力シーケンスの長さに比例してスケーリングし、展開された画像はシーケンスの長さが長くなります。さらに、変圧器には画像に適した誘導バイアスがありません。これらの欠点に対処するビジョントランスフォーマー（ViT）アーキテクチャへの3つの変更をテストしました。まず、Xフォーマー（Performer、Linformer、NyströmformerのXなど）と呼ばれる線形注意メカニズムを使用して2次ボトルネックを軽減し、Vision Xフォーマー（ViX）を作成します。これにより、GPUメモリ要件が最大7分の1に削減されました。また、FNetおよび多層パーセプトロンミキサーとのパフォーマンスを比較しました。これにより、GPUメモリ要件がさらに削減されました。次に、ViXで最初の線形埋め込み層を畳み込み層に置き換えることにより、画像に誘導バイアスを導入しました。これにより、モデルサイズを増やすことなく、分類の精度が大幅に向上しました。第3に、ViTの学習可能な1D位置埋め込みを、同じモデルサイズの分類精度を向上させるロータリー位置埋め込み（RoPE）に置き換えました。このような変更を組み込むことで、データとコンピューティングリソースが限られているトランスフォーマーにアクセスできるようにすることで、トランスフォーマーを民主化できると考えています。

Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks for computer vision. The attention mechanism of transformers scales quadratically with the length of the input sequence, and unrolled images have long sequence lengths. Plus, transformers lack an inductive bias that is appropriate for images. We tested three modifications to vision transformer (ViT) architectures that address these shortcomings. Firstly, we alleviate the quadratic bottleneck by using linear attention mechanisms, called X-formers (such that, X in Performer, Linformer, Nyströmformer), thereby creating Vision X-formers (ViXs). This resulted in up to a seven times reduction in the GPU memory requirement. We also compared their performance with FNet and multi-layer perceptron mixers, which further reduced the GPU memory requirement. Secondly, we introduced an inductive bias for images by replacing the initial linear embedding layer by convolutional layers in ViX, which significantly increased classification accuracy without increasing the model size. Thirdly, we replaced the learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE), which increases the classification accuracy for the same model size. We believe that incorporating such changes can democratize transformers by making them accessible to those with limited data and computing resources.

updated: Fri Oct 01 2021 15:08:54 GMT+0000 (UTC)

published: Mon Jul 05 2021 19:24:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト