Global Context Vision Transformers

Ali Hatamizadeh; Hongxu Yin; Jan Kautz; Pavlo Molchanov

グローバルコンテキストビジョントランスフォーマー

グローバルコンテキストビジョントランスフォーマー (GC ViT) を提案します。これは、コンピュータービジョンタスクのパラメーターと計算の使用率を向上させる新しいアーキテクチャです。新しいモデルのコアは、グローバルコンテキストの自己注意モジュールであり、標準のローカル自己注意と連携して、アテンションマスクやローカルウィンドウなどの複雑な操作の代替として、長距離と短距離の両方の空間的相互作用を効果的かつ効率的にモデル化します。シフト。ローカルセルフアテンションモジュールは短距離情報のモデリングを担当しますが、グローバルクエリトークンはすべてのグローバルセルフアテンションモジュールで共有され、ローカルキーと値を操作します。さらに、パラメーター効率の高い融合反転残差ブロックを活用する新しいダウンサンプラーを提案することにより、ViT の誘導バイアスの欠如に対処し、チャネル間の依存関係のモデリングを改善します。提案された GC ViT は、画像分類、オブジェクト検出、およびセマンティックセグメンテーションタスク全体で新しい最先端のパフォーマンスを実現します。分類用の ImageNet-1K データセットでは、28M、51M、および 90M パラメーターを使用した GC ViT の tiny、small、および base バリアントは、それぞれ 83.4%、83.9%、および 84.4% のトップ 1 精度を達成し、CNN などの同等サイズの先行技術を上回っています。ベースの ConvNeXt と ViT ベースの Swin Transformer。 MS COCO および ADE20K データセットでのオブジェクト検出、インスタンスセグメンテーション、およびセマンティックセグメンテーションのダウンストリームタスクにおける事前トレーニング済みの GC ViT バックボーンは、以前の作業を一貫して上回っており、場合によっては大幅な差をつけています。コードと事前トレーニング済みのモデルは、https://github.com/NVlabs/GCViT で入手できます。

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision tasks. The core of the novel model are global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, as an alternative to complex operations such as an attention masks or local windows shifting. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and values. In addition, we address the lack of inductive bias in ViTs and improve the modeling of inter-channel dependencies by proposing a novel downsampler which leverages a parameter-efficient fused inverted residual block. The proposed GC ViT achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the tiny, small and base variants of GC ViT with 28M, 51M and 90M parameters achieve 83.4%, 83.9% and 84.4% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins. Code and pre-trained models are available at https://github.com/NVlabs/GCViT.

updated: Sat Oct 01 2022 03:40:57 GMT+0000 (UTC)

published: Mon Jun 20 2022 18:42:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト