UNetFormer: An UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery

Libo Wang; Rui Li; Ce Zhang; Shenghui Fang; Chenxi Duan; Xiaoliang Meng; Peter M. Atkinson

UNetFormer：リモートセンシング都市シーン画像の効率的なセマンティックセグメンテーションのためのUNetのようなトランスフォーマー

リモートセンシングされた都市シーン画像のセマンティックセグメンテーションは、土地被覆マッピング、都市変化の検出、環境保護、経済評価など、幅広い実用的なアプリケーションで必要とされます。ディープラーニングテクノロジーの急速な発展に後押しされて、畳み込みニューラルネットワーク（CNN）は長年にわたってセマンティックセグメンテーションを支配してきました。 CNNは階層的な特徴表現を採用しており、ローカル情報抽出のための強力な機能を示しています。ただし、畳み込み層のローカルプロパティにより、ネットワークがグローバルコンテキストをキャプチャすることが制限されます。最近、コンピュータービジョンの分野でホットなトピックとして、Transformerはグローバルな情報モデリングで大きな可能性を示し、画像分類、オブジェクト検出、特にセマンティックセグメンテーションなどの多くのビジョン関連タスクを後押ししています。本論文では、リアルタイムの都市シーンセグメンテーションのためのUNetのようなトランスフォーマー（UNetFormer）を提案します。新しいUNetFormerは、CNNベースのエンコーダーとTransformerベースのデコーダーを備えたハイブリッド構造を採用し、高い計算効率でグローバルローカルコンテキストを学習します。広範な実験により、提案されたUNetFormerは、推論段階でより高速に実行されるだけでなく、最先端の軽量モデルと比較してより高い精度を生み出すことが明らかになりました。具体的には、提案されたUNetFormerはUAVidテストセットで67.8％mIoU、LoveDAデータセットで52.4％mIoUを達成しましたが、推論速度はNVIDIA GTX3090GPUで512x512の形状の入力で最大322.4FPS速度を達成できます。ソースコードは無料で入手できます。

Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment. Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose an UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. The novel UNetFormer adopts a hybrid structure with a CNN-based encoder and a Transformer-based decoder, learning global-local context with high computational efficiency. Extensive experiments reveal that the proposed UNetFormer not only runs faster during the inference stage but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved a 67.8% mIoU on the UAVid test set and a 52.4% mIoU on the LoveDA dataset, while the inference speed can achieve up to 322.4 FPS speed with the input in the shape of 512x512 on an NVIDIA GTX 3090 GPU. The source code will be freely available.

updated: Wed Apr 13 2022 10:03:55 GMT+0000 (UTC)

published: Sat Sep 18 2021 13:55:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト