ConvFormer: Closing the Gap Between CNN and Vision Transformers

Zimian Wei; Hengyue Pan; Xin Niu; Dongsheng Li

ConvFormer: CNN とビジョントランスフォーマーの間のギャップを埋める

ビジョントランスフォーマーは、コンピュータービジョンタスクで優れたパフォーマンスを示しています。ただし、彼らの (ローカル) 自己注意メカニズムの計算コストは高くなります。比較すると、CNN は誘導バイアスが組み込まれているため、より効率的です。最近の研究では、CNN がアーキテクチャ設計とトレーニングプロトコルを学習することで、ビジョントランスフォーマーと競合することを約束していることがわかります。それにもかかわらず、既存の方法は、マルチレベルの機能を無視するか、動的な繁栄を欠いているため、最適なパフォーマンスにつながりません。この論文では、複数のカーネルサイズによって入力画像のさまざまなパターンをキャプチャし、ゲーティングメカニズムを使用して入力適応重みを有効にする、MCA という新しい注意メカニズムを提案します。 MCA に基づいて、ConvFormer という名前のニューラルネットワークを提示します。 ConvFormer は、ビジョントランスフォーマーの一般的なアーキテクチャを採用し、(ローカル) 自己注意メカニズムを提案された MCA に置き換えます。広範な実験結果により、ConvFormer は、さまざまなタスクにおいて、同様のサイズのビジョントランスフォーマー (ViT) および畳み込みニューラルネットワーク (CNN) よりも優れていることが実証されました。たとえば、ConvFormer-S、ConvFormer-L は、ImageNet データセットで 82.8%、83.6% のトップ 1 精度という最先端のパフォーマンスを達成します。さらに、ConvFormer-S は Swin-T よりも ADE20K で 1.5 mIoU 優れており、モデルサイズが小さい COCO でバウンディングボックス AP が 0.9 優れています。コードとモデルが利用可能になります。

Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks. For example, ConvFormer-S, ConvFormer-L achieve state-of-the-art performance of 82.8%, 83.6% top-1 accuracy on ImageNet dataset. Moreover, ConvFormer-S outperforms Swin-T by 1.5 mIoU on ADE20K, and 0.9 bounding box AP on COCO with a smaller model size. Code and models will be available.

updated: Fri Sep 16 2022 06:45:01 GMT+0000 (UTC)

published: Fri Sep 16 2022 06:45:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト