Vision Transformer with Progressive Sampling

Xiaoyu Yue; Shuyang Sun; Zhanghui Kuang; Meng Wei; Philip Torr; Wayne Zhang; Dahua Lin

プログレッシブサンプリングを備えたビジョントランスフォーマー

最近、強力なグローバルリレーションモデリング機能を備えたトランスフォーマーが、基本的なコンピュータービジョンタスクに導入されました。典型的な例として、Vision Transformer（ViT）は、画像を固定長のトークンに分割し、トランスフォーマーを使用してこれらのトークン間の関係を学習することにより、画像分類に純粋なトランスフォーマーアーキテクチャを直接適用します。ただし、このような単純なトークン化は、オブジェクト構造を破壊し、背景などの関心のない領域にグリッドを割り当て、干渉信号を導入する可能性があります。上記の問題を軽減するために、この論文では、識別領域を見つけるための反復的かつ漸進的なサンプリング戦略を提案します。各反復で、現在のサンプリングステップの埋め込みがトランスエンコーダ層に供給され、サンプリングオフセットのグループが予測されて、次のステップのサンプリング位置が更新されます。プログレッシブサンプリングは微分可能です。得られたPS-ViTネットワークは、Vision Transformerと組み合わせると、どこを見ればよいかを適応的に学習できます。提案されたPS-ViTは効果的かつ効率的です。 ImageNetでゼロからトレーニングした場合、PS-ViTは、パラメーターが約4分の1、フロップが10分の1で、トップ1の精度に関してバニラViTよりも3.8％高いパフォーマンスを発揮します。コードはhttps://github.com/yuexy/PS-ViTで入手できます。

Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image classification, by simply splitting images into tokens with a fixed length, and employing transformers to learn relations between these tokens. However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as background, and introduce interference signals. To mitigate the above issues, in this paper, we propose an iterative and progressive sampling strategy to locate discriminative regions. At each iteration, embeddings of the current sampling step are fed into a transformer encoder layer, and a group of sampling offsets is predicted to update the sampling locations for the next step. The progressive sampling is differentiable. When combined with the Vision Transformer, the obtained PS-ViT network can adaptively learn where to look. The proposed PS-ViT is both effective and efficient. When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy with about 4× fewer parameters and 10× fewer FLOPs. Code is available at https://github.com/yuexy/PS-ViT.

updated: Tue Aug 03 2021 18:04:31 GMT+0000 (UTC)

published: Tue Aug 03 2021 18:04:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト