LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals

Arjun Karpur; Guilherme Perrotta; Ricardo Martin-Brualla; Howard Zhou; Andre Araujo

LFM-3D: 3D 信号を使用した幅広いベースラインにわたる学習可能な特徴マッチング

同じ物体の異なる画像間で局所的な対応関係を見つけることは、その形状を理解するために重要です。近年、深層学習ベースのローカル画像特徴と学習可能なマッチャーの出現により、この問題は目覚ましい進歩を遂げています。それでも、学習可能なマッチャーは、画像ペア間の共通可視領域が小さい場合（つまり、広いカメラベースライン）、パフォーマンスが低下することがよくあります。この問題に対処するために、粗い単一ビューのジオメトリ推定方法における最近の進歩を活用します。私たちは、グラフニューラルネットワークに基づくモデルを使用し、ノイズの多い推定 3D 信号を統合して対応推定を強化することで機能を強化する、学習可能な特徴マッチングフレームワークである LFM-3D を提案します。 3D 信号をマッチャーモデルに統合する場合、低次元の 3D 情報を効果的に利用するには適切な位置エンコーディングが重要であることを示します。私たちは 2 つの異なる 3D 信号 (正規化されたオブジェクト座標と単眼奥行き推定値) を実験し、広いベースラインにわたるオブジェクト中心の画像ペアを含む大規模 (合成および実際の) データセットでメソッドを評価します。 2D のみの方法と比較して、特徴マッチングが大幅に向上し、合計再現率が最大 +6%、固定再現率で精度が +28% 向上していることがわかります。さらに、結果として得られる改善された対応により、野生の画像ペアの相対的なポーズ精度が 2D のみのアプローチと比較して最大 8.6% 向上することを実証します。

Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning-based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. Additionally, we demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs - up to 8.6% compared to the 2D-only approach.

updated: Fri Aug 18 2023 20:25:57 GMT+0000 (UTC)

published: Wed Mar 22 2023 17:46:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト