MatchFormer: Interleaving Attention in Transformers for Feature Matching

Qing Wang; Jiaming Zhang; Kailun Yang; Kunyu Peng; Rainer Stiefelhagen

MatchFormer：機能マッチングのためのトランスフォーマーでのインターリーブ注意

ローカルフィーチャマッチングは、サブピクセルレベルでの計算集約型タスクです。特徴記述子と組み合わせた検出器ベースの方法は、低テクスチャのシーンで苦労しますが、シーケンシャルな抽出から一致へのパイプラインを備えたCNNベースの方法は、エンコーダーの一致能力を利用できず、一致のためにデコーダーに負担をかける傾向があります。対照的に、MatchFormerと呼ばれる新しい階層的な抽出と一致のトランスフォーマーを提案します。階層型エンコーダーの各ステージ内で、特徴抽出のための自己注意と特徴マッチングのための相互注意をインターリーブし、人間が直感的に抽出して一致するスキームを可能にします。このような一致認識エンコーダーは、過負荷のデコーダーを解放し、モデルを非常に効率的にします。さらに、階層アーキテクチャのマルチスケール機能で自己注意と相互注意を組み合わせると、特に低テクスチャの屋内シーンや屋外トレーニングデータが少ない場合に、マッチングの堅牢性が向上します。このような戦略のおかげで、MatchFormerは、効率、堅牢性、および精度の点でマルチウィンソリューションです。屋内ポーズ推定における以前の最良の方法と比較して、私たちのlite MatchFormerは45％のGFLOPしかありませんが、+ 1.3％の精度の向上と41％の走行速度の向上を実現しています。大規模なMatchFormerは、屋内ポーズ推定（ScanNet）、屋外ポーズ推定（MegaDepth）、ホモグラフィ推定と画像マッチング（HPatch）、視覚的ローカリゼーション（InLoc）など、4つの異なるベンチマークで最先端に到達します。コードはhttps://github.com/jamycheung/MatchFormerで公開されます。

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, enabling a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc). Code will be made publicly available at https://github.com/jamycheung/MatchFormer.

updated: Mon Mar 21 2022 18:36:30 GMT+0000 (UTC)

published: Thu Mar 17 2022 22:49:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト