Global Context Vision Transformers

Ali Hatamizadeh; Hongxu Yin; Greg Heinrich; Jan Kautz; Pavlo Molchanov

グローバルコンテキストビジョントランスフォーマー

私たちは、コンピュータービジョンのパラメーターとコンピューティングの利用を強化する新しいアーキテクチャであるグローバルコンテキストビジョントランスフォーマー (GC ViT) を提案します。私たちの方法は、グローバルコンテキストセルフアテンションモジュールを標準のローカルセルフアテンションと組み合わせて活用し、アテンションマスクの計算やローカルウィンドウのシフトなどの高価な操作を必要とせずに、長距離と短距離の両方の空間インタラクションを効果的かつ効率的にモデル化します。さらに、ViT の誘導バイアスの欠如に対処し、アーキテクチャで修正された融合反転残差ブロックを活用することを提案します。私たちが提案する GC ViT は、画像分類、オブジェクト検出、セマンティックセグメンテーションタスクにわたって最先端の結果を実現します。分類用の ImageNet-1K データセットでは、51M、90M、および 201M パラメーターを備えた GC ViT のバリアントは、224 の画像解像度で、事前トレーニングなしで、それぞれ 84.3%、85.0%、および 85.7% のトップ 1 精度を達成し、したがって、同等の精度を上回っています。 CNN ベースの ConvNeXt や ViT ベースの MaxViT や Swin Transformer などの従来技術よりもはるかに大きいサイズです。 MS COCO および ADE20K データセットを使用したオブジェクト検出、インスタンスセグメンテーション、セマンティックセグメンテーションの下流タスクにおける事前トレーニング済み GC ViT バックボーンは、一貫して以前の作業を上回ります。具体的には、4 スケール DINO 検出ヘッドを備えた GC ViT は、MS COCO データセットで 58.3 のボックス AP を達成します。

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently. Specifically, GC ViT with a 4-scale DINO detection head achieves a box AP of 58.3 on MS COCO dataset.

updated: Tue Jun 06 2023 08:17:18 GMT+0000 (UTC)

published: Mon Jun 20 2022 18:42:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト