UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery

Libo Wang; Rui Li; Ce Zhang; Shenghui Fang; Chenxi Duan; Xiaoliang Meng; Peter M. Atkinson

UNetFormer：リモートセンシング都市シーン画像の効率的なセマンティックセグメンテーションのためのUNetのようなトランスフォーマー

リモートセンシングされた都市シーン画像のセマンティックセグメンテーションは、土地被覆マッピング、都市変化の検出、環境保護、経済評価など、幅広い実用的なアプリケーションで必要とされます。深層学習技術の急速な発展、畳み込みニューラルネットワーク（CNN ）は、長年にわたってセマンティックセグメンテーションを支配してきました。 CNNは階層的な特徴表現を採用しており、ローカル情報抽出のための強力な機能を示しています。ただし、畳み込み層のローカルプロパティにより、ネットワークがグローバルコンテキストをキャプチャすることが制限されます。最近、コンピュータービジョンの分野でホットなトピックとして、Transformerはグローバルな情報モデリングで大きな可能性を示し、画像分類、オブジェクト検出、特にセマンティックセグメンテーションなどの多くのビジョン関連タスクを後押ししています。本論文では、トランスフォーマーベースのデコーダーを提案し、リアルタイムの都市シーンセグメンテーションのためのUNetのようなトランスフォーマー（UNetFormer）を構築します。効率的なセグメンテーションのために、UNetFormerは軽量のResNet18をエンコーダーとして選択し、デコーダーでグローバル情報とローカル情報の両方をモデル化するための効率的なグローバルローカルアテンションメカニズムを開発します。広範な実験により、私たちの方法は、最先端の軽量モデルと比較して、より高速に実行されるだけでなく、より高い精度を生み出すことが明らかになりました。具体的には、提案されたUNetFormerはUAVidおよびLoveDAデータセットでそれぞれ67.8％および52.4％mIoUを達成しましたが、推論速度は単一のNVIDIA GTX3090GPUで512x512入力で最大322.4FPSを達成できます。さらなる調査では、Swin Transformerエンコーダーと組み合わせた提案されたTransformerベースのデコーダーも、Vaihingenデータセットで最先端の結果（91.3％F1および84.1％mIoU）を達成します。ソースコードはhttps://github.com/WangLibo1995/GeoSegで無料で入手できます。

Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global-local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512x512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.

updated: Sun Jun 26 2022 14:15:18 GMT+0000 (UTC)

published: Sat Sep 18 2021 13:55:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト