Swin Transformer V2: Scaling Up Capacity and Resolution

Ze Liu; Han Hu; Yutong Lin; Zhuliang Yao; Zhenda Xie; Yixuan Wei; Jia Ning; Yue Cao; Zheng Zhang; Li Dong; Furu Wei; Baining Guo

Swin Transformer V2：容量と解像度のスケールアップ

大規模なNLPモデルは、飽和の兆候がなく、言語タスクのパフォーマンスを大幅に向上させることが示されています。彼らはまた、人間のような驚くべき数発の能力を示しています。この論文は、コンピュータビジョンにおける大規模モデルを探求することを目的としています。トレーニングの不安定性、事前トレーニングと微調整の間の解像度のギャップ、ラベル付けされたデータへの渇望など、大規模なビジョンモデルのトレーニングと適用における3つの主要な問題に取り組んでいます。 3つの主要な手法が提案されています。1）トレーニングの安定性を向上させるための正弦波注意と組み合わせた残余ポストノルム法。 2）低解像度画像を使用して事前トレーニングされたモデルを高解像度入力を使用するダウンストリームタスクに効果的に転送するためのログ間隔の連続位置バイアス法。 3）膨大なラベル付き画像の必要性を減らすための自己監視型の事前トレーニング方法であるSimMIM。これらの手法により、この論文では、これまでで最大の高密度ビジョンモデルである30億パラメータのSwin Transformer V2モデルのトレーニングに成功し、最大1,536×1,536の解像度の画像でトレーニングできるようになりました。 ImageNet-V2画像分類、COCOオブジェクト検出、ADE20Kセマンティックセグメンテーション、Kinetics-400ビデオアクション分類など、4つの代表的なビジョンタスクで新しいパフォーマンス記録を打ち立てました。また、Googleのトレーニングは、ラベル付けされたデータの消費量が40分の1、トレーニング時間が40分の1であるGoogleの10億レベルのビジュアルモデルよりもはるかに効率的であることに注意してください。コードはhttps://github.com/microsoft/Swin-Transformerで入手できます。

Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time. Code is available at https://github.com/microsoft/Swin-Transformer.

updated: Mon Apr 11 2022 16:03:17 GMT+0000 (UTC)

published: Thu Nov 18 2021 18:59:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト