Aerial Scene Understanding in The Wild: Multi-Scene Recognition via Prototype-based Memory Networks

Yuansheng Hua; Lichao Moua; Jianzhe Lin; Konrad Heidler; Xiao Xiang Zhu

野生の空中シーンの理解：プロトタイプベースのメモリネットワークを介したマルチシーン認識

空中シーン認識は基本的な視覚的タスクであり、ここ数年でますます研究の関心を集めています。現在の研究のほとんどは、主に航空写真を1つのシーンレベルのラベルに分類する取り組みを展開していますが、実際のシナリオでは、1つの画像に複数のシーンが存在することがよくあります。したがって、この論文では、より実用的で挑戦的なタスク、すなわち単一画像でのマルチシーン認識への一歩を踏み出すことを提案します。さらに、このようなタスクの注釈を手動で生成することは、非常に時間と労力がかかることに注意してください。これに対処するために、プロトタイプベースのメモリネットワークを提案し、注釈の付いた大量の単一シーン画像を活用して、単一画像内の複数のシーンを認識します。提案されたネットワークは、1）プロトタイプ学習モジュール、2）プロトタイプに生息する外部メモリ、および3）マルチヘッドアテンションベースのメモリ検索モジュールの3つの主要コンポーネントで構成されています。具体的には、まず、単一シーンの航空画像データセットから各航空シーンのプロトタイプ表現を学習し、それを外部メモリに保存します。その後、マルチヘッド注意ベースのメモリ検索モジュールが考案され、最終予測のためにマルチシーン画像をクエリすることに関連するシーンプロトタイプを検索します。特に、トレーニングフェーズで必要な注釈付きマルチシーン画像の数は限られています。空中シーン認識の進行を容易にするために、新しいマルチシーン空中画像（MAI）データセットを作成します。バリアントデータセット構成の実験結果は、ネットワークの有効性を示しています。私たちのデータセットとコードは公開されています。

Aerial scene recognition is a fundamental visual task and has attracted an increasing research interest in the last few years. Most of current researches mainly deploy efforts to categorize an aerial image into one scene-level label, while in real-world scenarios, there often exist multiple scenes in a single image. Therefore, in this paper, we propose to take a step forward to a more practical and challenging task, namely multi-scene recognition in single images. Moreover, we note that manually yielding annotations for such a task is extraordinarily time- and labor-consuming. To address this, we propose a prototype-based memory network to recognize multiple scenes in a single image by leveraging massive well-annotated single-scene images. The proposed network consists of three key components: 1) a prototype learning module, 2) a prototype-inhabiting external memory, and 3) a multi-head attention-based memory retrieval module. To be more specific, we first learn the prototype representation of each aerial scene from single-scene aerial image datasets and store it in an external memory. Afterwards, a multi-head attention-based memory retrieval module is devised to retrieve scene prototypes relevant to query multi-scene images for final predictions. Notably, only a limited number of annotated multi-scene images are needed in the training phase. To facilitate the progress of aerial scene recognition, we produce a new multi-scene aerial image (MAI) dataset. Experimental results on variant dataset configurations demonstrate the effectiveness of our network. Our dataset and codes are publicly available.

updated: Thu Apr 22 2021 17:32:14 GMT+0000 (UTC)

published: Thu Apr 22 2021 17:32:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト