Do We Really Need Scene-specific Pose Encoders?

Yoli Shavit; Ron Ferens

シーン固有のポーズエンコーダが本当に必要ですか？

視覚的ポーズ回帰モデルは、1回のフォワードパスでクエリ画像からカメラポーズを推定します。現在のモデルは、シーンごとにトレーニングされた深い畳み込みネットワークを使用して、画像からポーズエンコーディングを学習します。結果として得られるエンコーディングは、通常、ポーズを回帰するために多層パーセプトロンに渡されます。この作業では、シーン固有のポーズエンコーダーはポーズ回帰に必要ではなく、視覚的な類似性のためにトレーニングされたエンコーディングを代わりに使用できることを提案します。仮説をテストするために、完全に接続されたいくつかのレイヤーの浅いアーキテクチャを採用し、一般的な画像検索モデルから事前に計算されたエンコーディングを使用してトレーニングします。これらのエンコーディングは、カメラポーズを回帰するのに十分であるだけでなく、完全に接続された分岐アーキテクチャに提供されると、トレーニングされたモデルが競争力のある結果を達成し、場合によっては現在の最先端のポーズリグレッサーを超えることさえできることがわかります。。さらに、屋外ローカリゼーションの場合、提案されたアーキテクチャが、これまでのところ、2メートルおよび5度未満で一貫してローカライズする唯一のポーズリグレッサーであることを示します。

Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead. In order to test our hypothesis, we take a shallow architecture of several fully connected layers and train it with pre-computed encodings from a generic image retrieval model. We find that these encodings are not only sufficient to regress the camera pose, but that, when provided to a branching fully connected architecture, a trained model can achieve competitive results and even surpass current state-of-the-art pose regressors in some cases. Moreover, we show that for outdoor localization, the proposed architecture is the only pose regressor, to date, consistently localizing in under 2 meters and 5 degrees.

updated: Tue Dec 22 2020 13:59:52 GMT+0000 (UTC)

published: Tue Dec 22 2020 13:59:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト