Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

Hanrong Ye; Dan Xu

高密度シーン理解のための逆ピラミッドマルチタスクトランスフォーマー

マルチタスクの密集したシーンの理解は、ピクセル単位の予測を伴う一連の相関タスクの同時認識と推論を必要とする活発な研究領域です。ほとんどの既存の作品は、畳み込み演算を多用するため、局所的なモデリングの厳しい制限に直面しますが、グローバルな空間位置とマルチタスクのコンテキストで相互作用と推論を学習することは、この問題にとって重要です。この論文では、統一されたフレームワークで空間位置と複数のタスクの同時モデリングを実行するための新しいエンドツーエンドの逆ピラミッドマルチタスク（InvPT）トランスフォーマーを提案します。私たちの知る限り、これはシーンを理解するためのマルチタスク高密度予測のための変圧器構造の設計を探求する最初の作業です。さらに、高い空間分解能は高密度の予測に非常に有益であることが広く実証されていますが、既存の変圧器が大きな空間サイズに非常に複雑であるため、より高い分解能でより深くなることは非常に困難です。 InvPTは、効率的なUP-Transformerブロックを提供して、徐々に増加する解像度でマルチタスク機能の相互作用を学習します。また、効果的な自己注意メッセージパッシングとマルチスケール機能集約を組み込んで、高解像度でタスク固有の予測を生成します。私たちの方法は、NYUD-v2およびPASCAL-Contextデータセットでそれぞれ優れたマルチタスクパフォーマンスを実現し、以前の最先端技術を大幅に上回っています。コードとトレーニング済みモデルは公開されます。

Multi-task dense scene understanding is a thriving research domain that requires simultaneous perception and reasoning on a series of correlated tasks with pixel-wise prediction. Most existing works encounter a severe limitation of modeling in the locality due to heavy utilization of convolution operations, while learning interactions and inference in a global spatial-position and multi-task context is critical for this problem. In this paper, we propose a novel end-to-end Inverted Pyramid multi-task (InvPT) Transformer to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework. To the best of our knowledge, this is the first work that explores designing a transformer structure for multi-task dense prediction for scene understanding. Besides, it is widely demonstrated that a higher spatial resolution is remarkably beneficial for dense predictions, while it is very challenging for existing transformers to go deeper with higher resolutions due to huge complexity to large spatial size. InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions, which also incorporates effective self-attention message passing and multi-scale feature aggregation to produce task-specific prediction at a high resolution. Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts. Code and trained models will be publicly available.

updated: Tue Mar 15 2022 15:29:08 GMT+0000 (UTC)

published: Tue Mar 15 2022 15:29:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト