CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Xiaoyi Dong; Jianmin Bao; Dongdong Chen; Weiming Zhang; Nenghai Yu; Lu Yuan; Dong Chen; Baining Guo

CSWin Transformer：クロスシェイプウィンドウを備えたGeneral VisionTransformerバックボーン

汎用ビジョンタスク用の効率的で効果的なTransformerベースのバックボーンであるCSWinTransformerを紹介します。 Transformerの設計における難しい問題は、グローバルな自己注意は計算に非常にコストがかかるのに対し、ローカルな自己注意は各トークンの相互作用のフィールドを制限することが多いということです。この問題に対処するために、クロスシェイプウィンドウを形成する水平ストライプと垂直ストライプの自己注意を計算するためのクロスシェイプウィンドウ自己注意メカニズムを開発します。各ストライプは、入力フィーチャを等しいストライプに分割することによって取得されます幅。ストライプ幅の影響の詳細な数学的分析を提供し、計算コストを制限しながら強力なモデリング機能を実現するTransformerネットワークのさまざまなレイヤーのストライプ幅を変更します。また、ローカルで強化された位置エンコーディング（LePE）を紹介します。これは、既存のエンコーディングスキームよりもローカルの位置情報をより適切に処理します。 LePEは当然、任意の入力解像度をサポートするため、ダウンストリームタスクに特に効果的で使いやすいです。これらの設計と階層構造を組み込んだCSWinTransformerは、一般的なビジョンタスクで競争力のあるパフォーマンスを発揮します。具体的には、追加のトレーニングデータやラベルなしでImageNet-1Kで85.4％のトップ1精度を達成し、COCO検出タスクで53.9ボックスAPと46.4マスクAP、ADE20Kセマンティックセグメンテーションタスクで51.7 mIOUを達成し、以前の状態を上回ります。 -同様のFLOP設定で、それぞれ+ 1.2、+ 2.0、+ 1.4、および+2.0の最先端のSwinTransformerバックボーン。より大きなデータセットImageNet-21Kでさらに事前トレーニングすることにより、ImageNet-1Kで87.5％のTop-1精度を達成し、55.7mIoUのADE20Kで最先端のセグメンテーションパフォーマンスを実現します。コードとモデルはhttps://github.com/microsoft/CSWin-Transformerで入手できます。

We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a detailed mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 51.7 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and state-of-the-art segmentation performance on ADE20K with 55.7 mIoU. The code and models will be available at https://github.com/microsoft/CSWin-Transformer.

updated: Thu Jul 15 2021 17:59:49 GMT+0000 (UTC)

published: Thu Jul 01 2021 17:59:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト