Learning Multi-Scene Absolute Pose Regression with Transformers

Yoli Shavit; Ron Ferens; Yosi Keller

トランスフォーマーを使用したマルチシーン絶対ポーズ回帰の学習

絶対カメラポーズリグレッサは、キャプチャされた画像のみからカメラの位置と向きを推定します。通常、多層パーセプトロンヘッドを備えた畳み込みバックボーンは、一度に1つの参照シーンを埋め込むために画像とポーズラベルでトレーニングされます。最近、このスキームは、MLPヘッドを完全に接続されたレイヤーのセットに置き換えることにより、複数のシーンを学習するために拡張されました。この作業では、トランスフォーマーを使用してマルチシーンの絶対カメラポーズ回帰を学習することを提案します。エンコーダーを使用してアクティブ化マップを自己注意で集約し、デコーダーを使用して潜在的な特徴とシーンのエンコードを候補ポーズ予測に変換します。このメカニズムにより、モデルは、複数のシーンを並行して埋め込みながら、ローカリゼーションに役立つ一般的な機能に焦点を当てることができます。一般的にベンチマークされている屋内と屋外のデータセットでこの方法を評価し、マルチシーンと最先端のシングルシーンの絶対ポーズリグレッサの両方を上回っていることを示しています。コードはhttps://github.com/yolish/multi-scene-pose-transformerから公開されています。

Absolute camera pose regressors estimate the position and orientation of a camera from the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended for learning multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into candidate pose predictions. This mechanism allows our model to focus on general features that are informative for localization while embedding multiple scenes in parallel. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors. We make our code publicly available from https://github.com/yolish/multi-scene-pose-transformer.

updated: Mon Jul 26 2021 10:11:11 GMT+0000 (UTC)

published: Sun Mar 21 2021 19:21:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト