General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

Nhi Kieu; Kien Nguyen; Sridha Sridharan; Clinton Fookes

汎用マルチモーダルトランスフォーマーとリモートセンシングセマンティックセグメンテーションの融合

高解像度のマルチスペクトル/ハイパースペクトルセンサー、LiDAR DSM (デジタルサーフェスモデル) 情報、その他多くの情報の出現により、地球観測のための前例のない豊富なデータが私たちに提供されました。マルチモーダル AI は、特にセマンティックセグメンテーションなどの複雑なタスクに対して、これらの補完的なデータソースを活用しようとします。特殊なアーキテクチャが開発されてきましたが、モデル設計に多大な労力を費やして非常に複雑になっており、新しいモダリティが登場するたびにかなりの再エンジニアリングが必要になります。汎用マルチモーダルネットワークの最近の傾向は、1 つの統合アーキテクチャで複数のマルチモーダルタスクにわたって最先端のパフォーマンスを達成できる大きな可能性を示しています。この研究では、リモートセンシングセマンティックセグメンテーションドメインにおける汎用マルチモーダルファミリの 1 つである PerceiverIO のパフォーマンスを調査します。私たちの実験では、この表面上は普遍的なネットワークが、リモートセンシング画像における物体スケールの変動に苦戦しており、トップダウンビューから車の存在を検出できないことが明らかになりました。これらの問題に対処するために、極端なクラスの不均衡の問題であっても、空間的および体積学習コンポーネントを提案します。具体的には、3D 畳み込みを使用して重要なローカル情報をエンコードし、クロスモーダル機能を同時に学習すると同時に、PerceiverIO のクロスアテンションメカニズムを介してネットワークの計算負荷を軽減する、UNet にインスピレーションを受けたモジュールを設計します。提案されたコンポーネントの有効性は、2D 畳み込みやデュアルローカルモジュール (UNetFormer からインスピレーションを得た Conv2D 1x1 と Conv2D 3x3 の組み合わせ) などの他の方法と比較する広範な実験を通じて検証されています。提案された方法は、UNetFormer や SwinUNet などの特殊なアーキテクチャと競合する結果を達成し、パフォーマンスへの妥協を最小限に抑えてネットワークアーキテクチャエンジニアリングを最小限に抑える可能性を示しています。

The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM (Digital Surface Model) information and many others has provided us with an unprecedented wealth of data for Earth Observation. Multimodal AI seeks to exploit those complementary data sources, particularly for complex tasks like semantic segmentation. While specialized architectures have been developed, they are highly complicated via significant effort in model design, and require considerable re-engineering whenever a new modality emerges. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, even with extreme class imbalance issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution, and dual local module (i.e. the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.

updated: Fri Jul 07 2023 04:58:34 GMT+0000 (UTC)

published: Fri Jul 07 2023 04:58:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト