Aerial Image Object Detection With Vision Transformer Detector (ViTDet)

Liya Wang; Alex Tien

Vision Transformer Detector (ViTDet) による航空画像オブジェクト検出

過去数年間、環境研究、都市計画、インテリジェンス監視などの大規模な地球科学研究にとって重要な価値があるため、航空画像オブジェクト検出への関心が高まっています。ただし、鳥瞰図の視点、複雑な背景、大きくてさまざまな画像サイズ、オブジェクトのさまざまな外観、および十分に注釈が付けられたデータセットの不足により、このタスクは非常に困難です。コンピュータービジョンの最近の進歩は、この課題への取り組みに有望であることを示しています。具体的には、ビジョントランスフォーマー検出器 (ViTDet) は、オブジェクト検出用のマルチスケール機能を抽出するために提案されました。実証研究は、ViTDet のシンプルな設計が自然シーンの画像で優れたパフォーマンスを達成し、任意の検出器アーキテクチャに簡単に埋め込むことができることを示しています。今日まで、挑戦的な航空画像オブジェクト検出に対する ViTDet の潜在的な利点は調査されていません。したがって、私たちの研究では、Airbus Aircraft、RarePlanes、および航空画像のオブジェクト検出のデータセット (DOTA) の 3 つのよく知られたデータセットでの航空画像オブジェクト検出に対する ViTDet の有効性を評価するために、25 の実験が行われました。私たちの結果は、ViTDet が水平バウンディングボックス (HBB) オブジェクト検出で対応する畳み込みニューラルネットワークよりも一貫して優れたパフォーマンスを発揮し (平均精度で最大 17%)、指向性バウンディングボックス (OBB) オブジェクトに対して競争力のあるパフォーマンスを達成できることを示しています。検出。私たちの結果は、将来の研究のベースラインも確立します。

The past few years have seen an increased interest in aerial image object detection due to its critical value to large-scale geo-scientific research like environmental studies, urban planning, and intelligence monitoring. However, the task is very challenging due to the birds-eye view perspective, complex backgrounds, large and various image sizes, different appearances of objects, and the scarcity of well-annotated datasets. Recent advances in computer vision have shown promise tackling the challenge. Specifically, Vision Transformer Detector (ViTDet) was proposed to extract multi-scale features for object detection. The empirical study shows that ViTDet's simple design achieves good performance on natural scene images and can be easily embedded into any detector architecture. To date, ViTDet's potential benefit to challenging aerial image object detection has not been explored. Therefore, in our study, 25 experiments were carried out to evaluate the effectiveness of ViTDet for aerial image object detection on three well-known datasets: Airbus Aircraft, RarePlanes, and Dataset of Object DeTection in Aerial images (DOTA). Our results show that ViTDet can consistently outperform its convolutional neural network counterparts on horizontal bounding box (HBB) object detection by a large margin (up to 17% on average precision) and that it achieves the competitive performance for oriented bounding box (OBB) object detection. Our results also establish a baseline for future research.

updated: Thu Feb 02 2023 18:36:49 GMT+0000 (UTC)

published: Sat Jan 28 2023 02:25:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト