Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention

Gary Leung; Jun Gao; Xiaohui Zeng; Sanja Fidler

階層的なレベル間注意を使用した変圧器のセマンティックセグメンテーションの改善

既存のトランスフォーマーベースのイメージバックボーンは、通常、機能情報を下位レベルから上位レベルに一方向に伝播します。正確なオブジェクト境界を描くローカリゼーション機能は、低解像度のフィーチャマップで最も顕著であるため、これは理想的ではない可能性があります。一方、あるオブジェクトと別のオブジェクトに属する画像信号を明確にすることができるセマンティクスは、通常、より高いレベルで現れます。処理の。異なるレベルの機能間のボトムアップおよびトップダウンの更新をキャプチャするアテンションベースの方法であるHierarchicalInter-LevelAttention（HILA）を紹介します。 HILAは、バックボーンエンコーダーに上位レベルと下位レベルの機能間のローカル接続を追加することにより、階層型ビジョントランスフォーマーアーキテクチャを拡張します。各反復では、上位レベルの機能が割り当てを競合して、それらに属する下位レベルの機能を更新し、オブジェクトとパーツの関係を繰り返し解決することで、階層を構築します。これらの改善された低レベルの機能は、高レベルの機能を再更新するために使用されます。 HILAは、基本モデルに変更を加えることなく、大部分の階層アーキテクチャに統合できます。 SegFormerとSwinTransformerにHILAを追加し、パラメーターとFLOPSを減らして、セマンティックセグメンテーションの精度を大幅に向上させました。プロジェクトのウェブサイトとコード：https：//www.cs.toronto.edu/~garyleung/hila/

Existing transformer-based image backbones typically propagate feature information in one direction from lower to higher-levels. This may not be ideal since the localization ability to delineate accurate object boundaries, is most prominent in the lower, high-resolution feature maps, while the semantics that can disambiguate image signals belonging to one object vs. another, typically emerges in a higher level of processing. We present Hierarchical Inter-Level Attention (HILA), an attention-based method that captures Bottom-Up and Top-Down Updates between features of different levels. HILA extends hierarchical vision transformer architectures by adding local connections between features of higher and lower levels to the backbone encoder. In each iteration, we construct a hierarchy by having higher-level features compete for assignments to update lower-level features belonging to them, iteratively resolving object-part relationships. These improved lower-level features are then used to re-update the higher-level features. HILA can be integrated into the majority of hierarchical architectures without requiring any changes to the base model. We add HILA into SegFormer and the Swin Transformer and show notable improvements in accuracy in semantic segmentation with fewer parameters and FLOPS. Project website and code: https://www.cs.toronto.edu/~garyleung/hila/

updated: Tue Jul 05 2022 15:47:31 GMT+0000 (UTC)

published: Tue Jul 05 2022 15:47:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト