Lawin Transformer: Improving New-Era Vision Backbones with Multi-Scale Representations for Semantic Segmentation

Haotian Yan; Chuang Zhang; Ming Wu

Lawin Transformer: セマンティックセグメンテーションのマルチスケール表現による新時代のビジョンバックボーンの改善

マルチレベルアグリゲーション (MLA) モジュールは、セマンティックセグメンテーションにおける新時代のビジョンバックボーンを前進させるための重要なコンポーネントとして登場しました。この論文では、視覚バックボーンからのマルチスケール特徴マップを創造的に利用する新しい MLA アーキテクチャである Lawin (large window) Transformer を提案します。 Lawin Transformer の中核となるのは、ローカルウィンドウよりもはるかに大きなコンテキストウィンドウをクエリできる、新しく設計されたウィンドウアテンションメカニズムである Lawin アテンションです。私たちは、ラージウィンドウパラダイムの効率的かつ単純化された適用を研究することに重点を置き、クエリに対する大規模なコンテキストの比率を柔軟に調整し、マルチスケール表現をキャプチャできるようにします。私たちは、都市景観および ADE20K での Lawin Transformer の有効性を検証し、新時代のビジョンバックボーンと組み合わせた場合に、広く使用されている MLA モジュールよりも優れていることを一貫して実証しています。コードは https://github.com/yan-hao-tian/lawin で入手できます。

The multi-level aggregation (MLA) module has emerged as a critical component for advancing new-era vision back-bones in semantic segmentation. In this paper, we propose Lawin (large window) Transformer, a novel MLA architecture that creatively utilizes multi-scale feature maps from the vision backbone. At the core of Lawin Transformer is the Lawin attention, a newly designed window attention mechanism capable of querying much larger context windows than local windows. We focus on studying the efficient and simplistic application of the large-window paradigm, allowing for flexible regulation of the ratio of large context to query and capturing multi-scale representations. We validate the effectiveness of Lawin Transformer on Cityscapes and ADE20K, consistently demonstrating great superiority to widely-used MLA modules when combined with new-era vision backbones. The code is available at https://github.com/yan-hao-tian/lawin.

updated: Thu Aug 03 2023 06:13:54 GMT+0000 (UTC)

published: Wed Jan 05 2022 13:51:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト