P2T: Pyramid Pooling Transformer for Scene Understanding

Yu-Huan Wu; Yun Liu; Xin Zhan; Ming-Ming Cheng

P2T：シーン理解のためのピラミッドプーリングトランスフォーマー

このホワイトペーパーでは、ビジョントランスフォーマーの2つの問題を共同で解決します。i）マルチヘッド自己注意（MHSA）の計算は、計算/スペースの複雑さが高い。 ii）最近のビジョントランスフォーマーネットワークは、画像分類（単純なシナリオ、NLPに類似）と下流のシーン理解タスク（複雑なシナリオ、豊富な構造およびコンテキスト情報）の違いを無視して、画像分類用に過度に調整されています。この目的のために、ピラミッドプーリングは、コンテキスト抽象化におけるその強力な能力により、さまざまなビジョンタスクで効果的であることが実証されており、空間不変性というその自然な特性は、構造情報の損失に対処するのにも適しています（問題ii））。。したがって、ピラミッドプーリングをMHSAに適合させて、計算リソースに対する高い要件を緩和することを提案します（問題i））。このように、このプーリングベースのMHSAは、上記の2つの問題に適切に対処できるため、ダウンストリームのシーン理解タスクに対して柔軟で強力です。プーリングベースのMHSAを使用して、Pyramid Pooling Transformer（P2T）と呼ばれるダウンストリームタスク指向のトランスフォーマーネットワークを構築します。広範な実験により、バックボーンネットワークとしてP2Tを適用すると、以前のCNNベースおよびトランスフォーマーベースのネットワークと比較して、セマンティックセグメンテーション、オブジェクト検出、インスタンスセグメンテーション、視覚的顕著性検出などのさまざまなダウンストリームシーン理解タスクで実質的な優位性が示されます。コードはhttps://github.com/yuhuan-wu/P2Tでリリースされます。

This paper jointly resolves two problems in vision transformer: i) the computation of Multi-Head Self-Attention (MHSA) has high computational/space complexity; ii) recent vision transformer networks are overly tuned for image classification, ignoring the difference between image classification (simple scenarios, more similar to NLP) and downstream scene understanding tasks (complicated scenarios, rich structural and contextual information). To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction, and its natural property of spatial invariance is also suitable to address the loss of structural information (problem ii)). Hence, we propose to adapt pyramid pooling to MHSA for alleviating its high requirement on computational resources (problem i)). In this way, this pooling-based MHSA can well address the above two problems and is thus flexible and powerful for downstream scene understanding tasks. Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various downstream scene understanding tasks such as semantic segmentation, object detection, instance segmentation, and visual saliency detection, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

updated: Sat Jul 10 2021 16:22:53 GMT+0000 (UTC)

published: Tue Jun 22 2021 18:28:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト