PVTv2: Improved Baselines with Pyramid Vision Transformer

Wenhai Wang; Enze Xie; Xiang Li; Deng-Ping Fan; Kaitao Song; Ding Liang; Tong Lu; Ping Luo; Ling Shao

PVTv2：Pyramid VisionTransformerによるベースラインの改善

コンピュータビジョンのトランスフォーマーは、最近、勇気づけられる進歩を示しています。この作業では、（1）重複パッチ埋め込み、（2）畳み込みフィードフォワードネットワーク、（3）線形複雑性アテンションレイヤーを含む3つの改善設計を追加することにより、元のPyramid Vision Transformer（PVTv1）を改善します。これらの簡単な変更により、PVTv2は分類、検出、およびセグメンテーションでPVTv1を大幅に改善します。さらに、PVTv2は、SwinTransformerを含む最近の作品よりも優れたパフォーマンスを実現しています。この作業により、最先端のビジョンTransformerの研究がより利用しやすくなることを願っています。コードはhttps://github.com/whai362/PVTで入手できます。

Transformer in computer vision has recently shown encouraging progress. In this work, we improve the original Pyramid Vision Transformer (PVTv1) by adding three improvement designs, which include (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers. With these simple modifications, our PVTv2 significantly improves PVTv1 on classification, detection, and segmentation. Moreover, PVTv2 achieves better performance than recent works, including Swin Transformer. We hope this work will make state-of-the-art vision Transformer research more accessible. Code is available at https://github.com/whai362/PVT .

updated: Mon Jun 28 2021 15:07:07 GMT+0000 (UTC)

published: Fri Jun 25 2021 17:51:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト