MatchFormer: Interleaving Attention in Transformers for Feature Matching

Qing Wang; Jiaming Zhang; Kailun Yang; Kunyu Peng; Rainer Stiefelhagen

MatchFormer: 特徴マッチングのためのトランスフォーマーでの注意のインターリーブ

ローカルフィーチャマッチングは、サブピクセルレベルでの計算負荷の高いタスクです。特徴記述子と組み合わせた検出器ベースの方法はテクスチャの少ないシーンで苦労しますが、抽出から照合へのシーケンシャルパイプラインを使用する CNN ベースの方法は、エンコーダの照合能力を活用できず、照合のためにデコーダに過剰な負担をかける傾向があります。対照的に、MatchFormer と呼ばれる、新しい階層的な抽出と一致のトランスフォーマーを提案します。階層エンコーダーの各ステージ内で、特徴抽出のための自己注意と特徴マッチングのための相互注意をインターリーブし、人間の直感的な抽出と一致のスキームを生み出します。このような一致認識エンコーダーは、オーバーロードされたデコーダーを解放し、モデルを非常に効率的にします。さらに、階層アーキテクチャ内のマルチスケール機能で自己注意と交差注意を組み合わせることで、特に低テクスチャの屋内シーンや屋外のトレーニングデータが少ない場合に、マッチングの堅牢性が向上します。このような戦略のおかげで、MatchFormer は、効率、堅牢性、および精度において複数のメリットがあるソリューションです。屋内姿勢推定における以前の最良の方法と比較して、軽量の MatchFormer は 45% の GFLOP しかありませんが、+1.3% の精度向上と 41% の実行速度の向上を達成しています。大規模な MatchFormer は、屋内姿勢推定 (ScanNet)、屋外姿勢推定 (MegaDepth)、ホモグラフィ推定と画像マッチング (HPatch)、および視覚的位置特定 (InLoc) を含む 4 つの異なるベンチマークで最先端に達しています。

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).

updated: Fri Sep 23 2022 20:38:11 GMT+0000 (UTC)

published: Thu Mar 17 2022 22:49:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト