P2T: Pyramid Pooling Transformer for Scene Understanding

Yu-Huan Wu; Yun Liu; Xin Zhan; Ming-Ming Cheng

P2T: シーン理解のためのピラミッドプーリングトランスフォーマー

最近、ビジョントランスフォーマーは、さまざまなビジョンタスクの最先端を推進することにより、大きな成功を収めました。ビジョントランスフォーマーで最も困難な問題の 1 つは、イメージトークンのシーケンス長が大きいと、計算コストが高くなることです (二次的な複雑さ)。この問題に対する一般的な解決策は、単一のプーリング操作を使用してシーケンスの長さを減らすことです。このホワイトペーパーでは、1 回のプーリング操作で抽出されたプールされた特徴がそれほど強力ではないように見える既存のビジョントランスフォーマーを改善する方法を検討します。この目的のために、ピラミッドプーリングは、コンテキスト抽象化における強力な能力により、さまざまなビジョンタスクで効果的であることが実証されていることに注意してください。ただし、ピラミッドプーリングは、バックボーンネットワークの設計では検討されていません。このギャップを埋めるために、ピラミッドプーリングをビジョントランスフォーマーの Multi-Head Self-Attention (MHSA) に適応させ、同時にシーケンスの長さを短縮し、強力なコンテキスト機能をキャプチャすることを提案します。プーリングベースの MHSA をプラグインして、Pyramid Pooling Transformer (P2T) と呼ばれるユニバーサルビジョントランスフォーマーバックボーンを構築します。 P2T をバックボーンネットワークとして適用すると、以前の CNN およびトランスフォーマーベースのネットワークと比較して、画像分類、セマンティックセグメンテーション、オブジェクト検出、インスタンスセグメンテーションなどのさまざまな視覚タスクで大幅な優位性が示されることが広範な実験によって示されています。コードは https://github.com/yuhuan-wu/P2T で公開されます。

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

updated: Fri Aug 05 2022 07:54:44 GMT+0000 (UTC)

published: Tue Jun 22 2021 18:28:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト