Vision transformers (ViTs) have recently been used for visual matching beyond object detection and segmentation. However, the original grid dividing strategy of ViTs neglects the spatial information of the keypoints, limiting the sensitivity to local information. Therefore, we propose QueryTrans (Query Transformer), which adopts a cross-attention module and keypoints-based center crop strategy for better spatial information extraction. We further integrate the graph attention module and devise a transformer-based graph matching approach GMTR (Graph Matching TRansformers) whereby the combinatorial nature of GM is addressed by a graph transformer neural GM solver. On standard GM benchmarks, GMTR shows competitive performance against the SOTA frameworks. Specifically, on Pascal VOC, GMTR achieves 83.6% accuracy, 0.9% higher than the SOTA framework. On Spair-71k, GMTR shows great potential and outperforms most of the previous works. Meanwhile, on Pascal VOC, QueryTrans improves the accuracy of NGMv2 from 80.1% to 83.3%, and BBGM from 79.0% to 84.5%. On Spair-71k, QueryTrans improves NGMv2 from 80.6% to 82.5%, and BBGM from 82.1% to 83.9%. Source code will be made publicly available.
updated: Tue Nov 14 2023 13:12:47 GMT+0000 (UTC)
published: Tue Nov 14 2023 13:12:47 GMT+0000 (UTC)