CoAtNet: Marrying Convolution and Attention for All Data Sizes

Zihang Dai; Hanxiao Liu; Quoc V. Le; Mingxing Tan

CoAtNet：すべてのデータサイズに対する畳み込みと注意の融合

トランスフォーマーはコンピュータービジョンへの関心を高めていますが、それでも最先端の畳み込みネットワークに遅れをとっています。この作業では、トランスフォーマーのモデル容量が大きくなる傾向がありますが、適切な誘導バイアスがないため、一般化が畳み込みネットワークよりも悪い可能性があることを示します。両方のアーキテクチャの長所を効果的に組み合わせるために、2つの重要な洞察から構築されたハイブリッドモデルのファミリーであるCoAtNets（「コート」ネットと発音）を紹介します。（2）畳み込み層と注意層を原理的に垂直に積み重ねることは、一般化、容量、および効率を改善するのに驚くほど効果的です。実験によると、CoAtNetは、さまざまなデータセットにわたるさまざまなリソース制約の下で最先端のパフォーマンスを実現します。追加のデータがない場合、CoAtNetは86.0％のImageNetトップ1精度を実現します。 ImageNet-21Kからの13M画像で事前トレーニングすると、CoAtNetは88.56％のトップ1精度を達成し、23分の1のデータを使用しながら、JFT-300Mからの300M画像で事前トレーニングされたViT-hugeと一致します。特に、CoAtNetをJFT-3Bでさらにスケールアップすると、ImageNetで90.88％のトップ1の精度が達成され、新しい最先端の結果が確立されます。

Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.

updated: Wed Sep 15 2021 06:05:13 GMT+0000 (UTC)

published: Wed Jun 09 2021 04:35:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト