Incorporating Convolution Designs into Visual Transformers

Kun Yuan; Shaopeng Guo; Ziwei Liu; Aojun Zhou; Fengwei Yu; Wei Wu

畳み込み設計をビジュアルトランスフォーマーに組み込む

自然言語処理（NLP）タスクでのトランスフォーマーの成功に動機付けられて、トランスフォーマーをビジョンドメインに適用するいくつかの試み（ViTやDeiTなど）が出現します。ただし、純粋なTransformerアーキテクチャでは、畳み込みニューラルネットワーク（CNN）と同等のパフォーマンスを得るために、大量のトレーニングデータまたは追加の監視が必要になることがよくあります。これらの制限を克服するために、NLPからTransformerアーキテクチャを直接借用する場合の潜在的な欠点を分析します。次に、低レベルの特徴の抽出、局所性の強化におけるCNNの利点と、長距離の依存関係の確立におけるTransformerの利点を組み合わせた新しい畳み込み拡張画像トランスフォーマー（CeiT）を提案します。元のトランスフォーマーに3つの変更が加えられました。1）生の入力画像からの単純なトークン化の代わりに、生成された低レベルの機能からパッチを抽出するImage-to-Tokens（I2T）モジュールを設計します。 2）各エンコーダーブロックのフィードフォワードネットワークは、空間次元の隣接するトークン間の相関を促進するローカルに強化されたフィードフォワード（LeFF）レイヤーに置き換えられます。 3）マルチレベル表現を利用するTransformerの上部に、Layer-wise Class token Attention（LCA）が付加されています。 ImageNetと7つのダウンストリームタスクでの実験結果は、大量のトレーニングデータや追加のCNN教師を必要とせずに、以前のトランスフォーマーや最先端のCNNと比較したCeiTの有効性と一般化能力を示しています。さらに、CeiTモデルは、トレーニングの反復回数が3分の1になり、収束が向上することも示しています。これにより、トレーニングコストを大幅に削減できます。コードとモデルは、承認時にリリースされます。

Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new Convolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an Image-to-Tokens (I2T) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a Locally-enhanced Feed-Forward (LeFF) layer that promotes the correlation among neighboring tokens in the spatial dimension; 3) a Layer-wise Class token Attention (LCA) is attached at the top of the Transformer that utilizes the multi-level representations. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers. Besides, CeiT models also demonstrate better convergence with 3× fewer training iterations, which can reduce the training cost significantlyCode and models will be released upon acceptance..

updated: Mon Mar 22 2021 13:16:12 GMT+0000 (UTC)

published: Mon Mar 22 2021 13:16:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト