InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

Hanrong Ye; Dan Xu

InvPT: 高密度シーン理解のための逆ピラミッドマルチタスクトランスフォーマー

マルチタスクの高密度シーンの理解は、ピクセル単位の予測を使用した一連の相関タスクに対する同時認識と推論を必要とする盛んな研究領域です。既存のほとんどの作品は、畳み込み演算を多用するため、ローカルでのモデリングの深刻な制限に遭遇しますが、グローバルな空間位置とマルチタスクのコンテキストで相互作用と推論を学習することは、この問題にとって重要です。この論文では、統一されたフレームワークで空間位置と複数のタスクの同時モデリングを実行するための、新しいエンドツーエンドの逆ピラミッドマルチタスクトランスフォーマー (InvPT) を提案します。私たちの知る限り、これは、シーンを理解するためのマルチタスクの高密度予測のためのトランスフォーマー構造の設計を探求する最初の作業です。その上、より高い空間解像度は高密度の予測に非常に有益であることが広く実証されていますが、既存のトランスフォーマーが大きな空間サイズに非常に複雑であるため、より高い解像度でより深く進むことは非常に困難です。 InvPT は、効率的な UP-Transformer ブロックを提供して、マルチタスク機能の相互作用を学習し、解像度を徐々に上げていきます。これには、効果的なセルフアテンションメッセージパッシングとマルチスケール機能集約も組み込まれており、タスク固有の予測を高解像度で生成します。私たちの方法は、NYUD-v2およびPASCAL-Contextデータセットでそれぞれ優れたマルチタスクパフォーマンスを達成し、以前の最先端技術を大幅に上回っています。コードは https://github.com/prismformore/InvPT で入手できます。

Multi-task dense scene understanding is a thriving research domain that requires simultaneous perception and reasoning on a series of correlated tasks with pixel-wise prediction. Most existing works encounter a severe limitation of modeling in the locality due to heavy utilization of convolution operations, while learning interactions and inference in a global spatial-position and multi-task context is critical for this problem. In this paper, we propose a novel end-to-end Inverted Pyramid multi-task Transformer (InvPT) to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework. To the best of our knowledge, this is the first work that explores designing a transformer structure for multi-task dense prediction for scene understanding. Besides, it is widely demonstrated that a higher spatial resolution is remarkably beneficial for dense predictions, while it is very challenging for existing transformers to go deeper with higher resolutions due to huge complexity to large spatial size. InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions, which also incorporates effective self-attention message passing and multi-scale feature aggregation to produce task-specific prediction at a high resolution. Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts. The code is available at https://github.com/prismformore/InvPT

updated: Mon Nov 07 2022 02:00:02 GMT+0000 (UTC)

published: Tue Mar 15 2022 15:29:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト