Deep Permutation Equivariant Structure from Motion

Dror Moran; Hodaya Koslowsky; Yoni Kasten; Haggai Maron; Meirav Galun; Ronen Basri

運動からの深い順列同変構造

既存のディープメソッドは、ステレオおよびマルチビューステレオ設定で、つまりカメラが内部および外部の両方でキャリブレーションされている場合に、非常に正確な3D再構成を生成します。それにもかかわらず、深いネットワークを備えたマルチビュー設定でのカメラポーズと3Dシーン構造の同時回復の課題は依然として未解決です。 Structure from Motion（SFM）の射影因数分解とディープ行列補完技術に触発されて、静的シーンの複数の画像内のポイントトラックのセットが与えられると、カメラパラメーターと（スパース）の両方を回復するニューラルネットワークアーキテクチャを提案します。教師なし再投影損失を最小限に抑えることによるシーン構造。私たちのネットワークアーキテクチャは、問題の構造を尊重するように設計されています。求められる出力は、カメラとシーンポイントの両方の順列と同変です。特に、私たちの方法では、カメラパラメータや3Dポイント位置の初期化は必要ありません。アーキテクチャは、（1）単一シーンの再構築と（2）複数のシーンからの学習の2つの設定でテストします。内部でキャリブレーションされた設定とキャリブレーションされていない設定の両方でさまざまなデータセットに対して行われた私たちの実験は、私たちの方法が古典的な最先端の方法と同等にポーズと構造を正確に回復することを示しています。さらに、事前にトレーニングされたネットワークを使用して、精度を損なうことなく、安価な微調整を使用して新しいシーンを再構築できることを示します。

Existing deep methods produce highly accurate 3D reconstructions in stereo and multiview stereo settings, i.e., when cameras are both internally and externally calibrated. Nevertheless, the challenge of simultaneous recovery of camera poses and 3D scene structure in multiview settings with deep networks is still outstanding. Inspired by projective factorization for Structure from Motion (SFM) and by deep matrix completion techniques, we propose a neural network architecture that, given a set of point tracks in multiple images of a static scene, recovers both the camera parameters and a (sparse) scene structure by minimizing an unsupervised reprojection loss. Our network architecture is designed to respect the structure of the problem: the sought output is equivariant to permutations of both cameras and scene points. Notably, our method does not require initialization of camera parameters or 3D point locations. We test our architecture in two setups: (1) single scene reconstruction and (2) learning from multiple scenes. Our experiments, conducted on a variety of datasets in both internally calibrated and uncalibrated settings, indicate that our method accurately recovers pose and structure, on par with classical state of the art methods. Additionally, we show that a pre-trained network can be used to reconstruct novel scenes using inexpensive fine-tuning with no loss of accuracy.

updated: Tue Oct 19 2021 11:17:21 GMT+0000 (UTC)

published: Wed Apr 14 2021 08:50:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト