PVT v2: Improved Baselines with Pyramid Vision Transformer

Wenhai Wang; Enze Xie; Xiang Li; Deng-Ping Fan; Kaitao Song; Ding Liang; Tong Lu; Ping Luo; Ling Shao

PVT v2：PyramidVisionTransformerによるベースラインの改善

Transformerは最近、コンピュータービジョンの有望な進歩を示しました。この作業では、（1）線形複雑性アテンションレイヤー、（2）オーバーラップパッチ埋め込み、（3）畳み込みフィードフォワードネットワークを含む3つの設計を追加することにより、元のPyramid Vision Transformer（PVT v1）を改善することで新しいベースラインを提示します。これらの変更により、PVT v2は、PVT v1の計算の複雑さを線形に減らし、分類、検出、セグメンテーションなどの基本的なビジョンタスクを大幅に改善します。特に、提案されたPVT v2は、SwinTransformerなどの最近の作品と同等またはそれ以上のパフォーマンスを実現しています。この作業が、コンピュータービジョンにおける最先端のTransformerの研究を促進することを願っています。コードはhttps://github.com/whai362/PVTで入手できます。

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

updated: Thu Jun 30 2022 15:31:56 GMT+0000 (UTC)

published: Fri Jun 25 2021 17:51:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト