Global Context Vision Transformers

Ali Hatamizadeh; Hongxu Yin; Jan Kautz; Pavlo Molchanov

グローバルコンテキストビジョントランスフォーマー

パラメータと計算の利用を強化する新しいアーキテクチャであるグローバルコンテキストビジョントランスフォーマ (GC ViT) を提案します。私たちの方法は、グローバルコンテキストの自己注意モジュールをローカル自己注意と組み合わせて活用し、アテンションマスクの計算やローカルウィンドウのシフトなどの高価な操作を必要とせずに、長距離と短距離の両方の空間相互作用を効果的かつ効率的にモデル化します。さらに、アーキテクチャで修正された融合反転残差ブロックを使用することを提案することにより、ViT の誘導バイアスの欠如の問題に対処します。私たちが提案する GC ViT は、画像分類、オブジェクト検出、セマンティックセグメンテーションタスク全体で最先端の結果を達成します。分類用の ImageNet-1K データセットでは、28M、51M、および 90M パラメーターを使用した GC ViT の tiny、small、および base バリアントは、それぞれ 83.3%、83.9%、および 84.5% のトップ 1 精度を達成し、CNN などの同等サイズの先行技術を上回っています。 -ベースの ConvNeXt と ViT ベースの Swin Transformer に大きな差をつけています。 MS COCO および ADE20K データセットを使用した、オブジェクト検出、インスタンスセグメンテーション、およびセマンティックセグメンテーションのダウンストリームタスクにおける事前トレーニング済みの GC ViT バックボーンは、以前の作業を一貫して上回っており、場合によっては大幅な差をつけています。コードは https://github.com/NVlabs/GCViT で入手できます。

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization. Our method leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the issue of lack of the inductive bias in ViTs via proposing to use a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the tiny, small and base variants of GC ViT with 28M, 51M and 90M parameters achieve 83.3%, 83.9% and 84.5% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins. Code available at https://github.com/NVlabs/GCViT.

updated: Wed Sep 07 2022 21:02:00 GMT+0000 (UTC)

published: Mon Jun 20 2022 18:42:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト