PVTv2: Improved Baselines with Pyramid Vision Transformer

Wenhai Wang; Enze Xie; Xiang Li; Deng-Ping Fan; Kaitao Song; Ding Liang; Tong Lu; Ping Luo; Ling Shao

PVTv2：Pyramid VisionTransformerによるベースラインの改善

コンピュータビジョンのトランスフォーマーは、最近、勇気づけられる進歩を示しています。この作業では、3つの改善設計を追加することにより、元のPyramid Vision Transformer（PVTv1）を改善します。これには、（1）畳み込みを伴う局所的に連続する特徴、（2）ゼロパディングを伴う位置エンコーディング、および（3）平均を伴う線形複雑性アテンションレイヤーが含まれます。プーリング。これらの簡単な変更により、PVTv2は分類、検出、およびセグメンテーションでPVTv1を大幅に改善します。さらに、PVTv2は、ImageNet-1Kの事前トレーニングの下で、SwinTransformerを含む最近の作業よりもはるかに優れたパフォーマンスを実現します。この作業により、最先端のビジョンTransformerの研究がより利用しやすくなることを願っています。コードはhttps://github.com/whai362/PVTで入手できます。

Transformer in computer vision has recently shown encouraging progress. In this work, we improve the original Pyramid Vision Transformer (PVTv1) by adding three improvement designs, which include (1) locally continuous features with convolutions, (2) position encodings with zero paddings, and (3) linear complexity attention layers with average pooling. With these simple modifications, our PVTv2 significantly improves PVTv1 on classification, detection, and segmentation. Moreover, PVTv2 achieves much better performance than recent works, including Swin Transformer, under ImageNet-1K pre-training. We hope this work will make state-of-the-art vision Transformer research more accessible. Code is available at https://github.com/whai362/PVT .

updated: Fri Jun 25 2021 17:51:09 GMT+0000 (UTC)

published: Fri Jun 25 2021 17:51:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト