PVTv2: Improved Baselines with Pyramid Vision Transformer

Wenhai Wang; Enze Xie; Xiang Li; Deng-Ping Fan; Kaitao Song; Ding Liang; Tong Lu; Ping Luo; Ling Shao

PVTv2：Pyramid VisionTransformerによるベースラインの改善

Transformerは最近、コンピュータービジョンの有望な進歩を示しました。この作業では、（1）重複パッチ埋め込み、（2）畳み込みフィードフォワードネットワーク、（3）線形複雑性アテンションレイヤーを含む3つの設計を追加することにより、元のPyramid Vision Transformer（PVTv1と略記）を改善することにより、新しいベースラインを提示します。。これらの変更により、PVTv2は、分類、検出、セグメンテーションなどの3つのタスクでPVTv1を大幅に改善します。さらに、PVTv2は、SwinTransformerなどの最近の作品と同等またはそれ以上のパフォーマンスを実現します。この作業が、コンピュータービジョンにおける最先端のTransformer研究を促進することを願っています。コードはhttps://github.com/whai362/PVTで入手できます。

Transformer recently has shown encouraging progresses in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (abbreviated as PVTv1) by adding three designs, including (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers. With these modifications, our PVTv2 significantly improves PVTv1 on three tasks e.g., classification, detection, and segmentation. Moreover, PVTv2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT .

updated: Sat Jul 17 2021 15:12:25 GMT+0000 (UTC)

published: Fri Jun 25 2021 17:51:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト