Dynamic Focus-aware Positional Queries for Semantic Segmentation

Haoyu He; Jianfei Cai; Zizheng Pan; Jing Liu; Jing Zhang; Dacheng Tao; Bohan Zhuang

セマンティックセグメンテーションのための動的フォーカス認識位置クエリ

DETR のようなセグメンターは、クラスプロトタイプまたはターゲットセグメントを表す一連のクエリをエンドツーエンドでトレーニングする、セマンティックセグメンテーションにおける最新のブレークスルーを支えてきました。最近、最適化を容易にするために、前のデコーダブロックによって予測された前景領域のみに注意を向けるように各クエリを制限するマスクドアテンションが提案されています。有望ではありますが、データセット統計をエンコードする傾向がある学習可能なパラメータ化された位置クエリに依存しているため、個別のクエリのローカリゼーションが不正確になります。このホワイトペーパーでは、Dynamic Focus-aware Positional Queries (DFPQ) と呼ばれるセマンティックセグメンテーションのためのシンプルかつ効果的なクエリ設計を提案します。DFPQ は、前のデコーダブロックからのクロスアテンションスコアと、対応するブロックの位置エンコーディングを条件とする位置クエリを動的に生成します。画像の特徴、同時に。したがって、当社の DFPQ は、ターゲットセグメントの豊富なローカリゼーション情報を保持し、正確できめ細かな位置事前分布を提供します。さらに、ローカル関係集約を実行するために、低解像度の相互注意スコアに基づいてコンテキストトークンのみを集約することにより、高解像度の相互注意を効率的に処理することを提案します。 ADE20K と Cityscapes での広範な実験では、Mask2former の 2 つの変更により、フレームワークが SOTA パフォーマンスを達成し、ResNet-50、Swin-T、および ResNet-50、Swin-T、およびそれぞれ ADE20K 検証セットの Swin-B バックボーン。ソースコードは https://github.com/zip-group/FASeg にあります。

The DETR-like segmentors have underpinned the most recent breakthroughs in semantic segmentation, which end-to-end train a set of queries representing the class prototypes or target segments. Recently, masked attention is proposed to restrict each query to only attend to the foreground regions predicted by the preceding decoder block for easier optimization. Although promising, it relies on the learnable parameterized positional queries which tend to encode the dataset statistics, leading to inaccurate localization for distinct individual queries. In this paper, we propose a simple yet effective query design for semantic segmentation termed Dynamic Focus-aware Positional Queries (DFPQ), which dynamically generates positional queries conditioned on the cross-attention scores from the preceding decoder block and the positional encodings for the corresponding image features, simultaneously. Therefore, our DFPQ preserves rich localization information for the target segments and provides accurate and fine-grained positional priors. In addition, we propose to efficiently deal with high-resolution cross-attention by only aggregating the contextual tokens based on the low-resolution cross-attention scores to perform local relation aggregation. Extensive experiments on ADE20K and Cityscapes show that with the two modifications on Mask2former, our framework achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%, 1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones on the ADE20K validation set, respectively. Source code is at https://github.com/zip-group/FASeg.

updated: Mon Nov 21 2022 02:56:14 GMT+0000 (UTC)

published: Mon Apr 04 2022 05:16:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト