Self-supervised Audiovisual Representation Learning for Remote Sensing Data

Konrad Heidler; Lichao Mou; Di Hu; Pu Jin; Guangyao Li; Chuang Gan; Ji-Rong Wen; Xiao Xiang Zhu

リモートセンシングデータのための自己監視視聴覚表現学習

現在のディープラーニングアプローチの多くは、ImageNetなどの大規模なデータセットで事前トレーニングされたバックボーンネットワークを広範囲に使用しており、特定のタスクを実行するように微調整されています。リモートセンシングでは、比較可能な大きな注釈付きデータセットの欠如とセンシングプラットフォームの多様性が、同様の開発を妨げています。リモートセンシングで事前トレーニングされたバックボーンネットワークの可用性に貢献するために、ディープニューラルネットワークを事前トレーニングするための自己教師ありアプローチを考案します。ジオタグ付きのオーディオ録音とリモートセンシング画像の間の対応を利用することにより、これは完全にラベルのない方法で行われ、面倒な手動の注釈の必要性を排除します。この目的のために、世界中の同じ場所に配置された航空画像と音声サンプルで構成されるSoundingEarthデータセットを紹介します。次に、このデータセットを使用して、ResNetモデルを事前トレーニングし、両方のモダリティからのサンプルを共通の埋め込みスペースにマッピングします。これにより、モデルは、視覚と聴覚の両方の外観に影響を与えるシーンの主要なプロパティを理解できます。提案されたアプローチの有用性を検証するために、他の手段で得られた重みに対して、事前にトレーニングされた重みの伝達学習パフォーマンスを評価します。一般的に使用される多くのリモートセンシングデータセットでモデルを微調整することにより、私たちのアプローチがリモートセンシング画像の既存の事前トレーニング戦略よりも優れていることを示します。データセット、コード、および事前にトレーニングされたモデルの重みは、https：//github.com/khdlr/SoundingEarthで入手できます。

Many current deep learning approaches make extensive use of backbone networks pre-trained on large datasets like ImageNet, which are then fine-tuned to perform a certain task. In remote sensing, the lack of comparable large annotated datasets and the wide diversity of sensing platforms impedes similar developments. In order to contribute towards the availability of pre-trained backbone networks in remote sensing, we devise a self-supervised approach for pre-training deep neural networks. By exploiting the correspondence between geo-tagged audio recordings and remote sensing imagery, this is done in a completely label-free manner, eliminating the need for laborious manual annotation. For this purpose, we introduce the SoundingEarth dataset, which consists of co-located aerial imagery and audio samples all around the world. Using this dataset, we then pre-train ResNet models to map samples from both modalities into a common embedding space, which encourages the models to understand key properties of a scene that influence both visual and auditory appearance. To validate the usefulness of the proposed approach, we evaluate the transfer learning performance of pre-trained weights obtained against weights obtained through other means. By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery. The dataset, code and pre-trained model weights will be available at https://github.com/khdlr/SoundingEarth.

updated: Mon Aug 02 2021 07:50:50 GMT+0000 (UTC)

published: Mon Aug 02 2021 07:50:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト