CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding

Dilxat Muhtar; Xueliang Zhang; Pengfeng Xiao; Zhenshi Li; Feng Gu

CMID: リモートセンシング画像理解のための統合自己教師あり学習フレームワーク

自己教師あり学習 (SSL) は、人間が注釈を付けたラベルなしでタスクに依存しない表現を学習できるため、リモートセンシング (RS) および地球観測 (EO) コミュニティで広く注目されています。それにもかかわらず、ほとんどの既存の RS SSL メソッドは、グローバルなセマンティック分離可能表現またはローカルな空間的知覚可能表現の学習に限定されています。さまざまな RS ダウンストリームタスクに必要な表現はさまざまで複雑な場合が多いため、この学習戦略は RS の領域では最適ではないと主張します。この研究では、RS 画像表現学習により適した統合 SSL フレームワークを提案しました。提案された SSL フレームワークである Contrastive Mask Image Distillation (CMID) は、対照学習 (CL) とマスクされた画像モデリング (MIM) を自己蒸留法で組み合わせることにより、グローバルなセマンティック分離可能性とローカル空間知覚可能性の両方を備えた表現を学習できます。さらに、当社の CMID 学習フレームワークはアーキテクチャに依存せず、畳み込みニューラルネットワーク (CNN) とビジョントランスフォーマー (ViT) の両方と互換性があるため、CMID を RS 理解のためのさまざまな深層学習 (DL) アプリケーションに簡単に適応させることができます。 4 つのダウンストリームタスク (シーン分類、セマンティックセグメンテーション、オブジェクト検出、変化検出) について包括的な実験が行われ、その結果、CMID を使用して事前トレーニングされたモデルは、他の最先端の SSL よりも優れたパフォーマンスを達成することが示されています。複数のダウンストリームタスクのメソッド。コードと事前トレーニング済みのモデルは https://github.com/NJU-LHRS/official-CMID で利用可能になり、SSL の研究を促進し、RS 画像 DL アプリケーションの開発をスピードアップします。

Self-supervised learning (SSL) has gained widespread attention in the remote sensing (RS) and earth observation (EO) communities owing to its ability to learn task-agnostic representations without human-annotated labels. Nevertheless, most existing RS SSL methods are limited to learning either global semantic separable or local spatial perceptible representations. We argue that this learning strategy is suboptimal in the realm of RS, since the required representations for different RS downstream tasks are often varied and complex. In this study, we proposed a unified SSL framework that is better suited for RS images representation learning. The proposed SSL framework, Contrastive Mask Image Distillation (CMID), is capable of learning representations with both global semantic separability and local spatial perceptibility by combining contrastive learning (CL) with masked image modeling (MIM) in a self-distillation way. Furthermore, our CMID learning framework is architecture-agnostic, which is compatible with both convolutional neural networks (CNN) and vision transformers (ViT), allowing CMID to be easily adapted to a variety of deep learning (DL) applications for RS understanding. Comprehensive experiments have been carried out on four downstream tasks (i.e. scene classification, semantic segmentation, object-detection, and change detection) and the results show that models pre-trained using CMID achieve better performance than other state-of-the-art SSL methods on multiple downstream tasks. The code and pre-trained models will be made available at https://github.com/NJU-LHRS/official-CMID to facilitate SSL research and speed up the development of RS images DL applications.

updated: Wed Apr 19 2023 13:58:31 GMT+0000 (UTC)

published: Wed Apr 19 2023 13:58:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト