VOLO: Vision Outlooker for Visual Recognition

Li Yuan; Qibin Hou; Zihang Jiang; Jiashi Feng; Shuicheng Yan

VOLO：視覚認識のためのVision Outlooker

視覚認識は、何年もの間、畳み込みニューラルネットワーク（CNN）によって支配されてきました。最近、一般的なビジョントランスフォーマー（ViT）は、ImageNet分類で自己注意ベースのモデルの大きな可能性を示していますが、追加のデータが提供されない場合、そのパフォーマンスは最新のSOTACNNよりも劣ります。この作業では、パフォーマンスのギャップを埋めようとし、注意ベースのモデルが実際にCNNよりも優れていることを示します。 ImageNet分類用のViTのパフォーマンスを制限する主な要因は、トークン表現に細かいレベルの機能をエンコードする際の有効性が低いことです。これを解決するために、新しいOutlookの注意を導入し、Vision Outlooker（VOLO）と呼ばれるシンプルで一般的なアーキテクチャを紹介します。大まかなレベルでのグローバルな依存関係モデリングに焦点を当てる自己注意とは異なり、Outlookの注意は、より細かいレベルの機能とコンテキストをトークンに効率的にエンコードします。これは、認識パフォーマンスに非常に有益であることが示されていますが、自己注意ではほとんど無視されます。実験によると、VOLOはImageNet-1K分類で87.1％のトップ1精度を達成しています。これは、追加のトレーニングデータを使用せずに、この競合ベンチマークで87％の精度を超える最初のモデルです。さらに、事前トレーニングされたVOLOはダウンストリームに十分に転送されます。セマンティックセグメンテーションなどのタスク。都市景観検証セットで84.3％mIoUスコアを達成し、ADE20K検証セットで54.3％を達成します。コードはhttps://github.com/sail-sg/voloで入手できます。

Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided. In this work, we try to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We find a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark, without using any extra training data In addition, the pre-trained VOLO transfers well to downstream tasks, such as semantic segmentation. We achieve 84.3% mIoU score on the cityscapes validation set and 54.3% on the ADE20K validation set. Code is available at https://github.com/sail-sg/volo.

updated: Mon Jun 28 2021 14:40:33 GMT+0000 (UTC)

published: Thu Jun 24 2021 15:46:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト