Vision transformers (ViTs) have recently been used for visual matching beyond object detection and segmentation. However, the original grid dividing strategy of ViTs neglects the spatial information of the keypoints, limiting the sensitivity to local information. Therefore, we propose \textbf{QueryTrans} (Query Transformer), which adopts a cross-attention module and keypoints-based center crop strategy for better spatial information extraction. We further integrate the graph attention module and devise a transformer-based graph matching approach \textbf{GMTR} (Graph Matching TRansformers) whereby the combinatorial nature of GM is addressed by a graph transformer neural GM solver. On standard GM benchmarks, GMTR shows competitive performance against the SOTA frameworks. Specifically, on Pascal VOC, GMTR achieves $\mathbf{83.6\%}$ accuracy, $\mathbf{0.9\%}$ higher than the SOTA framework. On Spair-71k, GMTR shows great potential and outperforms most of the previous works. Meanwhile, on Pascal VOC, QueryTrans improves the accuracy of NGMv2 from $80.1\%$ to $\mathbf{83.3\%}$, and BBGM from $79.0\%$ to $\mathbf{84.5\%}$. On Spair-71k, QueryTrans improves NGMv2 from $80.6\%$ to $\mathbf{82.5\%}$, and BBGM from $82.1\%$ to $\mathbf{83.9\%}$. Source code will be made publicly available.
翻译:视觉Transformer(ViTs)近期已被用于目标检测与分割之外的视觉匹配任务。然而,ViTs原有的网格划分策略忽略了关键点的空间信息,限制了其对局部信息的敏感性。为此,我们提出**QueryTrans**(查询变换器),通过引入交叉注意力模块和基于关键点的中心裁剪策略,以提升空间信息提取能力。我们进一步融合图注意力模块,设计出基于Transformer的图匹配方法**GMTR**(图匹配Transformer),其中通过图Transformer神经GM求解器来处理图匹配的组合性质。在标准图匹配基准测试中,GMTR展现出与当前最优框架相媲美的性能。具体而言,在Pascal VOC数据集上,GMTR达到了$\mathbf{83.6\%}$的准确率,比当前最优框架高出$\mathbf{0.9\%}$。在Spair-71k数据集上,GMTR展现出巨大潜力,超越了以往大多数工作。同时,在Pascal VOC数据集上,QueryTrans将NGMv2的准确率从$80.1\%$提升至$\mathbf{83.3\%}$,将BBGM从$79.0\%$提升至$\mathbf{84.5\%}$。在Spair-71k数据集上,QueryTrans将NGMv2的准确率从$80.6\%$提升至$\mathbf{82.5\%}$,将BBGM从$82.1\%$提升至$\mathbf{83.9\%}$。源代码将公开提供。