A ConvNet for the 2020s

Zhuang Liu; Hanzi Mao; Chao-Yuan Wu; Christoph Feichtenhofer; Trevor Darrell; Saining Xie

2020年代のConvNet

視覚認識の「狂騒の20年代」は、最先端の画像分類モデルとしてConvNetsにすぐに取って代わったVision Transformers（ViTs）の導入から始まりました。一方、バニラViTは、オブジェクト検出やセマンティックセグメンテーションなどの一般的なコンピュータビジョンタスクに適用すると、問題に直面します。いくつかのConvNetの優先順位を再導入したのは、階層型Transformers（Swin Transformersなど）であり、Transformersを一般的なビジョンバックボーンとして実用的に実行可能にし、さまざまなビジョンタスクで優れたパフォーマンスを発揮します。ただし、このようなハイブリッドアプローチの有効性は、畳み込みの固有の誘導バイアスではなく、トランスフォーマーの固有の優位性に大きく依存しています。この作業では、設計スペースを再検討し、純粋なConvNetが達成できる限界をテストします。標準のResNetをビジョントランスフォーマーの設計に向けて徐々に「近代化」し、その過程でパフォーマンスの違いに寄与するいくつかの主要なコンポーネントを発見します。この調査の結果は、ConvNeXtと呼ばれる純粋なConvNetモデルのファミリーです。完全に標準のConvNetモジュールから構築されたConvNeXtsは、精度とスケーラビリティの点でTransformersと有利に競合し、標準のConvNetの単純さと効率を維持しながら、87.8％のImageNetトップ1精度を達成し、COCO検出とADE20KセグメンテーションでSwinTransformersを上回ります。

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

updated: Mon Jan 10 2022 18:59:10 GMT+0000 (UTC)

published: Mon Jan 10 2022 18:59:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト