Mutual Generative Transformer Learning for Cross-view Geo-localization

Jianwei Zhao; Qiang Zhai; Rui Huang; Hong Cheng

クロスビュージオローカリゼーションのための相互生成トランスフォーマー学習

巨大なジオタグ付き空中（衛星など）画像と照合することで地上カメラの地理的位置を推定することを目的としたクロスビュージオローカリゼーション（CVGL）は、ビュー間の外観の大幅な違いにより、依然として非常に困難です。既存の方法では、主にシャムのようなCNNを使用して、2つのモード間の相互利益を調べることなくグローバル記述子を抽出します。この論文では、CVGLのために、トランスフォーマーと組み合わせたクロスモーダル知識生成戦術、すなわち相互生成トランスフォーマー学習（MGTL）を使用する新しいアプローチを提示します。具体的には、MGTLは2つの別個の生成モジュールを開発します。1つは地上レベルのセマンティック情報から空中のような知識を生成するためのもので、その逆も同様です。注意メカニズムを通じて相互の利点を十分に活用します。挑戦的な公開ベンチマーク、CVACTおよびCVUSAでの実験は、既存の最先端モデルと比較した提案された方法の有効性を示しています。

Cross-view geo-localization (CVGL), which aims to estimate the geographical location of the ground-level camera by matching against enormous geo-tagged aerial (e.g., satellite) images, remains extremely challenging due to the drastic appearance differences across views. Existing methods mainly employ Siamese-like CNNs to extract global descriptors without examining the mutual benefits between the two modes. In this paper, we present a novel approach using cross-modal knowledge generative tactics in combination with transformer, namely mutual generative transformer learning (MGTL), for CVGL. Specifically, MGTL develops two separate generative modules--one for aerial-like knowledge generation from ground-level semantic information and vice versa--and fully exploits their mutual benefits through the attention mechanism. Experiments on challenging public benchmarks, CVACT and CVUSA, demonstrate the effectiveness of the proposed method compared to the existing state-of-the-art models.

updated: Thu Mar 17 2022 07:29:02 GMT+0000 (UTC)

published: Thu Mar 17 2022 07:29:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト