Attention-Aware Age-Agnostic Visual Place Recognition

Ziqi Wang; Jiahui Li; Seyran Khademi; Jan van Gemert

注意を意識した年齢に依存しない視覚的場所認識

クロスドメインの視覚的場所認識（VPR）タスクがこの作業で提案されています。つまり、異なるドメインに描かれた同じアーキテクチャの画像を照合します。 VPRは通常、画像取得タスクとして扱われます。このタスクでは、不明な場所からのクエリ画像が、ジオタグ付きギャラリーデータベースからの関連インスタンスと照合されます。クエリ画像とギャラリー画像が同じドメインに由来する従来のVPR設定とは異なり、クエリ画像が新しい見えない条件下で収集される、より一般的でありながら挑戦的な設定を提案します。この作業に関係する2つのドメインは、Mapillaryデータセットからのアムステルダムの現代的なストリートビュー画像（ソースドメイン）とBeeldbankデータセットからの同じ都市の履歴画像（ターゲットドメイン）です。ドメイン不変オブジェクトに焦点を当て、弱く監督されたランキング損失に基づいて画像を照合することを学習できる年齢不変の特徴学習CNNを調整しました。トレインとテストデータ間のドメインの不一致に対してロバストなアテンション集約モジュールを提案します。さらに、マルチカーネル最大平均不一致（MK-MMD）ドメイン適応損失を採用して、クロスドメインのランキングパフォーマンスを向上させています。ランキングおよび損失は弱い監視を使用しますが、注意と適応の両方のモジュールは監視されません。目視検査により、アテンションモジュールが構築されたフォームに焦点を合わせている一方で、劇的に変化する環境の重みが少ないことがわかります。提案されたCNNは、単一ドメインVPRタスクで最先端の結果（99％の精度）を達成し、クロスドメインVPRタスクで最高の20％の精度を達成し、年齢不変VPRの難しさを明らかにします。

A cross-domain visual place recognition (VPR) task is proposed in this work, i.e., matching images of the same architectures depicted in different domains. VPR is commonly treated as an image retrieval task, where a query image from an unknown location is matched with relevant instances from geo-tagged gallery database. Different from conventional VPR settings where the query images and gallery images come from the same domain, we propose a more common but challenging setup where the query images are collected under a new unseen condition. The two domains involved in this work are contemporary street view images of Amsterdam from the Mapillary dataset (source domain) and historical images of the same city from Beeldbank dataset (target domain). We tailored an age-invariant feature learning CNN that can focus on domain invariant objects and learn to match images based on a weakly supervised ranking loss. We propose an attention aggregation module that is robust to domain discrepancy between the train and the test data. Further, a multi-kernel maximum mean discrepancy (MK-MMD) domain adaptation loss is adopted to improve the cross-domain ranking performance. Both attention and adaptation modules are unsupervised while the ranking loss uses weak supervision. Visual inspection shows that the attention module focuses on built forms while the dramatically changing environment are less weighed. Our proposed CNN achieves state of the art results (99% accuracy) on the single-domain VPR task and 20% accuracy at its best on the cross-domain VPR task, revealing the difficulty of age-invariant VPR.

updated: Wed Sep 11 2019 16:04:42 GMT+0000 (UTC)

published: Wed Sep 11 2019 16:04:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト