Efficient Hybrid Transformer: Learning Global-local Context for Urban Scene Segmentation

Libo Wang; Shenghui Fang; Ce Zhang; Rui Li; Chenxi Duan

効率的なハイブリッドトランスフォーマー：都市シーンセグメンテーションのためのグローバルローカルコンテキストの学習

高解像度の都市シーン画像のセマンティックセグメンテーションは、土地被覆マッピング、都市変化の検出、環境保護、経済評価など、広範な実用的なアプリケーションで重要な役割を果たします。ディープラーニングテクノロジーの急速な発展に後押しされて、畳み込みニューラルネットワーク（CNN）は、長年にわたってセマンティックセグメンテーションタスクを支配してきました。畳み込みニューラルネットワークは階層的特徴表現を採用しており、強力なローカル情報抽出を示しています。ただし、畳み込み層のローカルプロパティにより、ネットワークは正確なセグメンテーションに不可欠なグローバルコンテキストをキャプチャできなくなります。最近、Transformerはコンピュータービジョンの分野で注目を集めています。 Transformerは、グローバル情報モデリングの優れた機能を実証し、画像分類、オブジェクト検出、特にセマンティックセグメンテーションなどの多くの視覚タスクを強化します。この論文では、リアルタイムの都市シーンセグメンテーションのための効率的なハイブリッド変圧器（EHT）を提案します。 EHTは、CNNベースのエンコーダーとトランスフォーマーベースのデコーダーを備えたハイブリッド構造を採用し、より少ない計算でグローバルローカルコンテキストを学習します。広範な実験により、EHTは、最先端の軽量モデルと比較して、より高速な推論速度と競争力のある精度を備えていることが実証されています。具体的には、提案されたEHTはUAVidテストセットで66.9％mIoUを達成し、他のベンチマークネットワークを大幅に上回っています。コードはまもなく利用可能になります。

Semantic segmentation of fine-resolution urban scene images plays a vital role in extensive practical applications, such as land cover mapping, urban change detection, environmental protection and economic assessment. Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated the semantic segmentation task for many years. Convolutional neural networks adopt hierarchical feature representation, demonstrating strong local information extraction. However, the local property of the convolution layer limits the network from capturing global context that is crucial for precise segmentation. Recently, Transformer comprise a hot topic in the computer vision domain. Transformer demonstrates the great capability of global information modelling, boosting many vision tasks, such as image classification, object detection and especially semantic segmentation. In this paper, we propose an efficient hybrid Transformer (EHT) for real-time urban scene segmentation. The EHT adopts a hybrid structure with and CNN-based encoder and a transformer-based decoder, learning global-local context with lower computation. Extensive experiments demonstrate that our EHT has faster inference speed with competitive accuracy compared with state-of-the-art lightweight models. Specifically, the proposed EHT achieves a 66.9% mIoU on the UAVid test set and outperforms other benchmark networks significantly. The code will be available soon.

updated: Wed Oct 13 2021 13:45:53 GMT+0000 (UTC)

published: Sat Sep 18 2021 13:55:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト