ConvFormer: Closing the Gap Between CNN and Vision Transformers

Zimian Wei; Hengyue Pan; Xin Niu; Dongsheng Li

ConvFormer: CNN とビジョントランスフォーマーの間のギャップを埋める

ビジョントランスフォーマーは、コンピュータービジョンタスクで優れたパフォーマンスを示しています。ただし、彼らの (ローカル) 自己注意メカニズムの計算コストは高くなります。比較すると、CNN は誘導バイアスが組み込まれているため、より効率的です。最近の研究では、CNN がアーキテクチャ設計とトレーニングプロトコルを学習することで、ビジョントランスフォーマーと競合することを約束していることがわかります。それにもかかわらず、既存の方法は、マルチレベルの機能を無視するか、動的な繁栄を欠いているため、最適なパフォーマンスにつながりません。この論文では、複数のカーネルサイズによって入力画像のさまざまなパターンをキャプチャし、ゲーティングメカニズムを使用して入力適応重みを有効にする、MCA という新しい注意メカニズムを提案します。 MCA に基づいて、ConvFormer という名前のニューラルネットワークを提示します。 ConvFormer は、ビジョントランスフォーマーの一般的なアーキテクチャを採用し、(ローカル) 自己注意メカニズムを提案された MCA に置き換えます。広範な実験結果により、ConvFormer が ImageNet 分類で最先端のパフォーマンスを達成し、同様のサイズのビジョントランスフォーマー (ViT) や畳み込みニューラルネットワーク (CNN) よりも優れていることが実証されました。さらに、COCO でのオブジェクト検出と ADE20K でのセマンティックセグメンテーションタスクについても、ConvFormer は最近の高度な方法と比較して優れたパフォーマンスを示します。コードとモデルが利用可能になります。

Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer achieves state-of-the-art performance on ImageNet classification, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs). Moreover, for object detection on COCO and semantic segmentation tasks on ADE20K, ConvFormer also shows excellent performance compared with recently advanced methods. Code and models will be available.

updated: Thu Sep 22 2022 06:11:12 GMT+0000 (UTC)

published: Fri Sep 16 2022 06:45:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト