Optimal Feature Transport for Cross-View Image Geo-Localization

Yujiao Shi; Xin Yu; Liu Liu; Tong Zhang; Hongdong Li

クロスビュー画像の地理的ローカリゼーションのための最適な特徴輸送

このホワイトペーパーでは、大規模な航空地図（高解像度の衛星画像など）と照合することで、地上レベルのストリートビュークエリ画像の地理的位置を推定する、クロスビュー画像の地理的ローカリゼーションの問題に取り組んでいます。最先端のディープラーニングベースの方法は、2つの異なるビューで見られるシーンのグローバルな特徴表現を学習することを目的としたディープメトリック学習としてこの問題に取り組んでいます。このようなディープメトリック学習法によって有望な結果が得られていますが、ローカライズに関連する重要なキュー、つまりローカルフィーチャの空間レイアウトを活用できません。さらに、クロスビューローカリゼーションのコンテキストでは、明らかなドメインギャップ（空中ビューと地上ビューの間）にほとんど注意が払われません。本論文では、地上画像と空中画像間のフィーチャの位置合わせを容易にするクロスビュードメイン転送を明示的に確立するための、新しいクロスビューフィーチャトランスポート（CVFT）手法を提案します。具体的には、CVFTをネットワークレイヤーとして実装します。これにより、あるドメインから別のドメインに機能が転送され、より意味のある機能の類似性の比較が可能になります。モデルは微分可能であり、エンドツーエンドで学習できます。大規模なデータセットの実験により、本手法により、CVUSAデータセットなどの最新のクロスビューローカリゼーションパフォーマンスが著しく向上し、トップ1のリコールが40.79％から61.43％に大幅に改善されたこと、トップ10の場合は76.36％から90.49％です。論文の重要な洞察（つまり、ドメイントランスポートを介したドメインの違いの明示的な処理）は、コンピュータービジョンにおける他の同様の問題にも役立つことが期待されます。

This paper addresses the problem of cross-view image geo-localization, where the geographic location of a ground-level street-view query image is estimated by matching it against a large scale aerial map (e.g., a high-resolution satellite image). State-of-the-art deep-learning based methods tackle this problem as deep metric learning which aims to learn global feature representations of the scene seen by the two different views. Despite promising results are obtained by such deep metric learning methods, they, however, fail to exploit a crucial cue relevant for localization, namely, the spatial layout of local features. Moreover, little attention is paid to the obvious domain gap (between aerial view and ground view) in the context of cross-view localization. This paper proposes a novel Cross-View Feature Transport (CVFT) technique to explicitly establish cross-view domain transfer that facilitates feature alignment between ground and aerial images. Specifically, we implement the CVFT as network layers, which transports features from one domain to the other, leading to more meaningful feature similarity comparison. Our model is differentiable and can be learned end-to-end. Experiments on large-scale datasets have demonstrated that our method has remarkably boosted the state-of-the-art cross-view localization performance, e.g., on the CVUSA dataset, with significant improvements for top-1 recall from 40.79% to 61.43%, and for top-10 from 76.36% to 90.49%. We expect the key insight of the paper (i.e., explicitly handling domain difference via domain transport) will prove to be useful for other similar problems in computer vision as well.

updated: Wed Nov 27 2019 05:53:04 GMT+0000 (UTC)

published: Thu Jul 11 2019 06:56:40 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト