InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding

Hanrong Ye; Dan Xu

InvPT++: ビジュアルシーンを理解するための逆ピラミッド型マルチタスクトランスフォーマー

マルチタスクシーン理解は、1 つの多用途モデルで複数のシーン理解タスクを同時に予測できるモデルを設計することを目的としています。これまでの研究では通常、マルチタスクの特徴をよりローカルな方法で処理するため、空間的にグローバルな相互作用やタスク間の相互作用を効果的に学習することができず、マルチタスク学習におけるさまざまなタスクの一貫性を十分に活用するモデルの能力が妨げられています。この問題に取り組むために、グローバルコンテキスト内のさまざまなタスクの空間特徴間のタスク間の相互作用をモデル化できる、逆ピラミッドマルチタスクトランスフォーマーを提案します。具体的には、最初にトランスフォーマーエンコーダーを利用して、すべてのタスクのタスク一般的な特徴をキャプチャします。次に、空間的およびクロスタスク相互作用をグローバルに確立するトランスデコーダーを設計し、マルチタスク機能の解像度を段階的に高め、さまざまなスケールでクロスタスク相互作用を確立する新しい UP-Transformer ブロックを考案しました。さらに、異なる機能スケールにわたるクロスタスク相互作用を効率的に促進するために、2 種類のクロススケールセルフアテンションモジュール、つまり融合注意と選択的注意が提案されています。デコーダ内のマルチスケール情報をより適切にモデル化するために、エンコーダ機能集約戦略がさらに導入されています。いくつかの 2D/3D マルチタスクベンチマークに関する包括的な実験により、私たちの提案の有効性が明確に実証され、重要な最先端のパフォーマンスが確立されました。

Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models' ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.

updated: Thu Jun 08 2023 00:28:22 GMT+0000 (UTC)

published: Thu Jun 08 2023 00:28:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト