Building extraction with vision transformer

Libo Wang; Shenghui Fang; Rui Li; Xiaoliang Meng

ビジョントランスフォーマーによる建物の抽出

人間の生産活動の重要なキャリアとして、建物の抽出は都市の動的監視に不可欠であるだけでなく、郊外の建設検査にも必要です。今日、建物の複雑な背景と多様な外観のために、リモートセンシング画像からの正確な建物の抽出は依然として課題となっています。畳み込みニューラルネットワーク（CNN）ベースの建物抽出方法は、精度は大幅に向上しますが、グローバルな依存関係をモデル化できないことで批判されています。したがって、このペーパーでは、建物の抽出にVisionTransformerを適用します。ただし、Vision Transformerの実際の使用には、多くの場合2つの制限があります。まず、Vision Transformerは、CNNと比較してより多くのGPUメモリと計算コストを必要とします。この制限は、高解像度のリモートセンシング画像のような大きなサイズの入力に遭遇した場合にさらに拡大されます。第2に、Vision Transformerの特徴抽出中に空間の詳細が十分に保存されないため、建物を細かく分割することができません。これらの問題を処理するために、デュアルパス構造を備えた新しいVision Transformer（BuildFormer）を提案します。具体的には、豊富な空間詳細をエンコードするための空間詳細コンテキストパスと、グローバル依存関係をキャプチャするためのグローバルコンテキストパスを設計します。さらに、ウィンドウベースの線形マルチヘッド自己注意を開発して、マルチヘッド自己注意の複雑さをウィンドウサイズに合わせて線形にします。これにより、大きなウィンドウを使用してグローバルコンテキスト抽出が強化され、大型リモートセンシング画像の処理におけるVisionTransformer。提案された方法は、マサチューセッツの建物データセットで最先端のパフォーマンス（75.74％IoU）をもたらします。コードが利用可能になります。

As an important carrier of human productive activities, the extraction of buildings is not only essential for urban dynamic monitoring but also necessary for suburban construction inspection. Nowadays, accurate building extraction from remote sensing images remains a challenge due to the complex background and diverse appearances of buildings. The convolutional neural network (CNN) based building extraction methods, although increased the accuracy significantly, are criticized for their inability for modelling global dependencies. Thus, this paper applies the Vision Transformer for building extraction. However, the actual utilization of the Vision Transformer often comes with two limitations. First, the Vision Transformer requires more GPU memory and computational costs compared to CNNs. This limitation is further magnified when encountering large-sized inputs like fine-resolution remote sensing images. Second, spatial details are not sufficiently preserved during the feature extraction of the Vision Transformer, resulting in the inability for fine-grained building segmentation. To handle these issues, we propose a novel Vision Transformer (BuildFormer), with a dual-path structure. Specifically, we design a spatial-detailed context path to encode rich spatial details and a global context path to capture global dependencies. Besides, we develop a window-based linear multi-head self-attention to make the complexity of the multi-head self-attention linear with the window size, which strengthens the global context extraction by using large windows and greatly improves the potential of the Vision Transformer in processing large-sized remote sensing images. The proposed method yields state-of-the-art performance (75.74% IoU) on the Massachusetts building dataset. Code will be available.

updated: Wed Apr 13 2022 09:34:42 GMT+0000 (UTC)

published: Mon Nov 29 2021 11:23:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト