Paying Attention to Multiscale Feature Maps in Multimodal Image Matching

Aviad Moreshet; Yosi Keller

マルチモーダル画像マッチングにおけるマルチスケール特徴マップに注意を払う

マルチスケールシャムCNNの特徴マップに対応するTransformerエンコーダーを使用したマルチモーダル画像パッチマッチングのための注意ベースのアプローチを提案します。私たちのエンコーダーは、タスク固有の外観不変の画像キューを強調しながら、マルチスケール画像埋め込みを効率的に集約することが示されています。また、エンコーダーをバイパスする残余接続を使用した、注意残余アーキテクチャーを紹介します。この追加の学習信号により、エンドツーエンドのトレーニングを最初から容易に行うことができます。私たちのアプローチは、マルチモーダルとシングルモダリティの両方のベンチマークで新しい最先端の精度を達成することが実験的に示され、その一般的な適用性を示しています。私たちの知る限り、これは、マルチモーダル画像パッチマッチングタスクへのTransformerエンコーダアーキテクチャの最初の成功した実装です。

We propose an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. We also introduce an attention-residual architecture, using a residual connection bypassing the encoder. This additional learning signal facilitates end-to-end training from scratch. Our approach is experimentally shown to achieve new state-of-the-art accuracy on both multimodal and single modality benchmarks, illustrating its general applicability. To the best of our knowledge, this is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.

updated: Sat Mar 20 2021 21:14:24 GMT+0000 (UTC)

published: Sat Mar 20 2021 21:14:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト