MaxViT: Multi-Axis Vision Transformer

Zhengzhong Tu; Hossein Talebi; Han Zhang; Feng Yang; Peyman Milanfar; Alan Bovik; Yinxiao Li

MaxViT：多軸ビジョントランスフォーマー

トランスフォーマーは最近、コンピュータービジョンコミュニティで大きな注目を集めています。ただし、画像サイズに関する自己注意メカニズムのスケーラビリティの欠如は、最先端のビジョンバックボーンでの幅広い採用を制限しています。この論文では、多軸注意と呼ばれる効率的でスケーラブルな注意モデルを紹介します。これは、ブロックされたローカル注意と拡張されたグローバル注意の2つの側面で構成されます。これらの設計上の選択により、線形の複雑さのみで、任意の入力解像度でのグローバルとローカルの空間的相互作用が可能になります。また、提案されたアテンションモデルと畳み込みを効果的にブレンドすることで新しいアーキテクチャ要素を提示し、それに応じて、基本的な構成要素を複数のステージで繰り返すだけで、MaxViTと呼ばれる単純な階層ビジョンバックボーンを提案します。特に、MaxViTは、初期の高解像度の段階でも、ネットワーク全体でグローバルに「見る」ことができます。幅広い視覚課題に対するモデルの有効性を示します。画像分類では、MaxViTはさまざまな設定で最先端のパフォーマンスを実現します。追加のデータがない場合、MaxViTは86.5％のImageNet-1Kトップ1の精度を実現します。 ImageNet-21Kの事前トレーニングにより、私たちのモデルは88.7％のトップ1精度を達成します。ダウンストリームタスクの場合、バックボーンとしてのMaxViTは、オブジェクトの検出と視覚的な美的評価で優れたパフォーマンスを発揮します。また、提案されたモデルがImageNetで強力な生成モデリング機能を表現していることを示し、ユニバーサルビジョンモジュールとしてのMaxViTブロックの優れた可能性を示しています。ソースコードとトレーニング済みモデルは、https：//github.com/google-research/maxvitで入手できます。

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.

updated: Sun Jul 24 2022 05:35:39 GMT+0000 (UTC)

published: Mon Apr 04 2022 17:59:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト