Global Context Vision Transformers

Ali Hatamizadeh; Hongxu Yin; Jan Kautz; Pavlo Molchanov

グローバルコンテキストビジョントランスフォーマー

パラメータと計算の使用率を向上させる新しいアーキテクチャであるグローバルコンテキストビジョントランスフォーマー（GC ViT）を提案します。私たちの方法は、グローバルコンテキスト自己注意モジュールを活用し、ローカル自己注意と組み合わせて、注意マスクの計算やローカルウィンドウのシフトなどの高価な操作を必要とせずに、長距離と短距離の両方の空間相互作用を効果的かつ効率的にモデル化します。さらに、アーキテクチャで修正された融合反転残余ブロックを使用することを提案することにより、ViTの誘導バイアスの欠如の問題に対処します。私たちが提案するGCViTは、画像分類、オブジェクト検出、セマンティックセグメンテーションタスク全体で最先端の結果を実現します。分類用のImageNet-1Kデータセットでは、28M、51M、および90Mのパラメーターを持つGC ViTの基本、小型、および小型のバリアントは、それぞれ83.2％、83.9％、および84.4％のトップ1精度を達成し、CNNなどの同等のサイズの先行技術を上回ります。ベースのConvNeXtおよびViTベースのSwinTransformerを大幅に削減。 MS COCOおよびADE20Kデータセットを使用したオブジェクト検出、インスタンスセグメンテーション、およびセマンティックセグメンテーションのダウンストリームタスクで事前にトレーニングされたGC ViTバックボーンは、以前の作業よりも一貫して、場合によっては大幅に優れています。コードはhttps://github.com/NVlabs/GCViTで入手できます。

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization. Our method leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the issue of lack of the inductive bias in ViTs via proposing to use a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the base, small and tiny variants of GC ViT with 28M, 51M and 90M parameters achieve 83.2%, 83.9% and 84.4% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins. Code available at https://github.com/NVlabs/GCViT.

updated: Mon Jun 20 2022 18:42:44 GMT+0000 (UTC)

published: Mon Jun 20 2022 18:42:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト