RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data

Yang Zhan; Zhitong Xiong; Yuan Yuan

RSVG: リモートセンシングデータのビジュアルグラウンディングのためのデータとモデルの探索

このホワイトペーパーでは、リモートセンシングデータ (RSVG) のビジュアルグラウンディングのタスクを紹介します。 RSVG は、自然言語のガイダンスを使用して、リモートセンシング (RS) 画像内の参照オブジェクトをローカライズすることを目的としています。自然言語を使用して RS 画像から豊富な情報を取得するために、RS 画像の視覚的質問応答、RS 画像キャプション、RS 画像テキスト検索などの多くの研究タスクが調査されてきました。ただし、RS 画像のオブジェクトレベルの視覚的根拠はまだ調査されていません。したがって、この作業では、データセットを構築し、RSVG タスクのディープラーニングモデルを調査することを提案します。具体的には、以下のように要約できます。 1) RSVG の研究を完全に進めるために、RSVGD と呼ばれる RSVG の新しい大規模なベンチマークデータセットを構築します。この新しいデータセットには、視覚的グラウンディングモデルのトレーニングと評価のための画像/式/ボックストリプレットが含まれています。 2) 構築された RSVGD データセットで広範な最先端 (SOTA) の自然画像の視覚的接地方法をベンチマークし、結果に基づいていくつかの洞察に満ちた分析を提供します。 3) 新しい変圧器ベースのマルチレベルクロスモーダル機能学習 (MLCM) モジュールが提案されています。通常、リモートセンシングされた画像は、スケールが大きく変化し、背景が雑然としています。スケール変動の問題に対処するために、MLCM モジュールは、マルチスケールの視覚的特徴とマルチグラニュラリティのテキスト埋め込みを利用して、より差別的な表現を学習します。雑然とした背景の問題に対処するために、MLCM は無関係なノイズを適応的にフィルタリングし、顕著な特徴を強調します。このようにして、提案されたモデルは、パフォーマンスを向上させるために、より効果的なマルチレベルおよびマルチモーダル機能を組み込むことができます。さらに、この作業は、より優れた RSVG モデルを開発するための有用な洞察も提供します。データセットとコードは、https://github.com/ZhanYang-nwpu/RSVG-pytorch で公開されます。

In this paper, we introduce the task of visual grounding for remote sensing data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. To retrieve rich information from RS imagery using natural language, many research tasks, like RS image visual question answering, RS image captioning, and RS image-text retrieval have been investigated a lot. However, the object-level visual grounding on RS images is still under-explored. Thus, in this work, we propose to construct the dataset and explore deep learning models for the RSVG task. Specifically, our contributions can be summarized as follows. 1) We build the new large-scale benchmark dataset of RSVG, termed RSVGD, to fully advance the research of RSVG. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models. 2) We benchmark extensive state-of-the-art (SOTA) natural image visual grounding methods on the constructed RSVGD dataset, and some insightful analyses are provided based on the results. 3) A novel transformer-based Multi-Level Cross-Modal feature learning (MLCM) module is proposed. Remotely-sensed images are usually with large scale variations and cluttered backgrounds. To deal with the scale-variation problem, the MLCM module takes advantage of multi-scale visual features and multi-granularity textual embeddings to learn more discriminative representations. To cope with the cluttered background problem, MLCM adaptively filters irrelevant noise and enhances salient features. In this way, our proposed model can incorporate more effective multi-level and multi-modal features to boost performance. Furthermore, this work also provides useful insights for developing better RSVG models. The dataset and code will be publicly available at https://github.com/ZhanYang-nwpu/RSVG-pytorch.

updated: Sun Oct 23 2022 07:08:22 GMT+0000 (UTC)

published: Sun Oct 23 2022 07:08:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト