Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classification

Sen Jia; Yifan Wang

ハイパースペクトル画像分類のためのセンターマスク事前トレーニングを備えたマルチスケール畳み込みトランスフォーマー

ハイパースペクトル画像（HSI）は、広い巨視的視野を持っているだけでなく、豊富なスペクトル情報を含んでおり、ハイパースペクトル画像関連の研究の主な用途の1つであるスペクトル情報から表面オブジェクトの種類を識別できます。、ますます多くの深層学習方法が提案されており、その中で畳み込みニューラルネットワーク（CNN）が最も影響力があります。ただし、CNNベースの方法は、長距離の依存関係をキャプチャするのが難しく、モデルトレーニングのために大量のラベル付きデータも必要とします。さらに、HSI分類の分野での自己監視トレーニング方法のほとんどは、サンプルを入力し、ラベルのないサンプルを効果的に使用することは困難です。 CNNネットワークの欠点に対処するために、HSI用の新しいマルチスケール畳み込み埋め込みモジュールを提案して、空間スペクトル情報の効果的な抽出を実現します。これは、Transformerネットワークとより適切に組み合わせることができます。ラベルのないデータをより効率的に使用するために、新しい自己監視プレタスクを提案します。マスクオートエンコーダーに似ていますが、事前トレーニング方法では、エンコーダーの中央ピクセルの対応するトークンのみをマスクし、残りのトークンをデコーダーに入力して、中央ピクセルのスペクトル情報を再構築します。このような事前タスクにより、関係をより適切にモデル化できます。中央機能とドメイン機能の間で、より安定したトレーニング結果を取得します。

Hyperspectral images (HSI) not only have a broad macroscopic field of view but also contain rich spectral information, and the types of surface objects can be identified through spectral information, which is one of the main applications in hyperspectral image related research.In recent years, more and more deep learning methods have been proposed, among which convolutional neural networks (CNN) are the most influential. However, CNN-based methods are difficult to capture long-range dependencies, and also require a large amount of labeled data for model training.Besides, most of the self-supervised training methods in the field of HSI classification are based on the reconstruction of input samples, and it is difficult to achieve effective use of unlabeled samples. To address the shortcomings of CNN networks, we propose a noval multi-scale convolutional embedding module for HSI to realize effective extraction of spatial-spectral information, which can be better combined with Transformer network.In order to make more efficient use of unlabeled data, we propose a new self-supervised pretask. Similar to Mask autoencoder, but our pre-training method only masks the corresponding token of the central pixel in the encoder, and inputs the remaining token into the decoder to reconstruct the spectral information of the central pixel.Such a pretask can better model the relationship between the central feature and the domain feature, and obtain more stable training results.

updated: Mon Mar 21 2022 03:02:37 GMT+0000 (UTC)

published: Wed Mar 09 2022 14:42:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト