ResT: An Efficient Transformer for Visual Recognition

Qinglong Zhang; Yubin Yang

ResT: 視覚認識のための効率的な変換器

この論文では、ResT と呼ばれる効率的なマルチスケールビジョントランスフォーマーを紹介します。これは、画像認識の汎用バックボーンとして機能する可能性があります。標準の Transformer ブロックを使用して固定解像度で未加工の画像に取り組む既存の Transformer メソッドとは異なり、ResT にはいくつかの利点があります。賢明な畳み込み、およびマルチヘッドの多様性能力を維持しながら、アテンションヘッド次元全体の相互作用を投影します。（2）位置エンコーディングは、より柔軟で、補間や微調整なしで任意のサイズの入力画像に取り組むことができる空間的注意として構築されます。 (3) 各段階の開始時に単純なトークン化の代わりに、パッチの埋め込みを、2D 再成形トークンマップ上のストライドで重複する畳み込み演算のスタックとして設計します。画像分類とダウンストリームタスクで ResT を包括的に検証します。実験結果は、提案された ResT が最近の最先端のバックボーンを大幅に上回ることができることを示しており、ResT の強力なバックボーンとしての可能性を示しています。コードとモデルはhttps://github.com/wofmanaf/ResTで公開されます。

This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.

updated: Fri Jul 09 2021 08:12:19 GMT+0000 (UTC)

published: Fri May 28 2021 08:53:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト