Head-Free Lightweight Semantic Segmentation with Linear Transformer

Bo Dong; Pichao Wang; Fan Wang

Linear Transformer を使用したヘッドフリーの軽量セマンティックセグメンテーション

既存のセマンティックセグメンテーション作業は、主に効果的なデコーダの設計に焦点を当ててきました。ただし、全体的な構造によって導入される計算負荷は長い間無視されてきたため、リソースに制約のあるハードウェアでのアプリケーションを妨げています。この論文では、Adaptive Frequency Transformer と呼ばれるセマンティックセグメンテーション専用のヘッドフリー軽量アーキテクチャを提案します。並列アーキテクチャを採用して、プロトタイプ表現を特定の学習可能なローカル記述として活用します。これは、デコーダーを置き換え、高解像度機能の豊富な画像セマンティクスを保持します。デコーダーを削除すると、ほとんどの計算が圧縮されますが、並列構造の精度は依然として計算リソースが少ないために制限されます。したがって、計算コストをさらに節約するために、ピクセル埋め込みとプロトタイプ表現に異種演算子 (CNN と Vision Transformer) を採用しています。さらに、視覚トランスフォーマーの複雑さを空間ドメインの観点から線形化することは非常に困難です。セマンティックセグメンテーションは頻度情報に非常に敏感であるため、標準的な自己注意を O(n^2) に置き換えるために、複雑さ O(n) の適応頻度フィルタを使用して軽量のプロトタイプ学習ブロックを構築します。広く採用されているデータセットでの広範な実験により、モデルが 3M パラメーターのみを保持しながら優れた精度を達成することが実証されています。 ADE20K データセットでは、モデルは 41.8 mIoU と 4.6 GFLOP を達成します。これは Segformer よりも 4.4 mIoU 高く、GFLOP は 45% 少なくなっています。 Cityscapes データセットでは、モデルは 78.7 mIoU と 34.4 GFLOP を達成します。これは Segformer よりも 2.5 mIoU 高く、GFLOP は 72.5% 少なくなっています。コードは https://github.com/dongbo811/AFFormer で入手できます。

Existing semantic segmentation works have been mainly focused on designing effective decoders; however, the computational load introduced by the overall structure has long been ignored, which hinders their applications on resource-constrained hardwares. In this paper, we propose a head-free lightweight architecture specifically for semantic segmentation, named Adaptive Frequency Transformer. It adopts a parallel architecture to leverage prototype representations as specific learnable local descriptions which replaces the decoder and preserves the rich image semantics on high-resolution features. Although removing the decoder compresses most of the computation, the accuracy of the parallel structure is still limited by low computational resources. Therefore, we employ heterogeneous operators (CNN and Vision Transformer) for pixel embedding and prototype representations to further save computational costs. Moreover, it is very difficult to linearize the complexity of the vision Transformer from the perspective of spatial domain. Due to the fact that semantic segmentation is very sensitive to frequency information, we construct a lightweight prototype learning block with adaptive frequency filter of complexity O(n) to replace standard self attention with O(n^2). Extensive experiments on widely adopted datasets demonstrate that our model achieves superior accuracy while retaining only 3M parameters. On the ADE20K dataset, our model achieves 41.8 mIoU and 4.6 GFLOPs, which is 4.4 mIoU higher than Segformer, with 45% less GFLOPs. On the Cityscapes dataset, our model achieves 78.7 mIoU and 34.4 GFLOPs, which is 2.5 mIoU higher than Segformer with 72.5% less GFLOPs. Code is available at https://github.com/dongbo811/AFFormer.

updated: Wed Jan 11 2023 18:59:46 GMT+0000 (UTC)

published: Wed Jan 11 2023 18:59:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト