Template matching is a fundamental task in computer vision and has been studied for decades. It plays an essential role in manufacturing industry for estimating the poses of different parts, facilitating downstream tasks such as robotic grasping. Existing methods fail when the template and source images have different modalities, cluttered backgrounds or weak textures. They also rarely consider geometric transformations via homographies, which commonly exist even for planar industrial parts. To tackle the challenges, we propose an accurate template matching method based on differentiable coarse-to-fine correspondence refinement. We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image, allowing robust matching. An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers. This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation. Extensive evaluation shows that our method is significantly better than state-of-the-art methods and baselines, providing good generalization ability and visually plausible results even on unseen real data.
翻译:模板匹配是计算机视觉中的一项基本任务,并已研究数十年。它在制造业中扮演着关键角色,用于估计不同部件的位姿,从而促进机器人抓取等下游任务。现有方法在模板与源图像具有不同模态、杂乱背景或弱纹理时失败。它们也很少考虑通过单应性变换实现的几何变换,而这在平面工业部件中普遍存在。为应对这些挑战,我们提出一种基于可微分粗到细对应关系优化的精确模板匹配方法。我们采用边缘感知模块来克服掩码模板与灰度图像之间的域差距,实现鲁棒匹配。基于Transformer提供的全新结构感知信息,利用粗粒度对应关系估计初始形变。该初始对齐结果被传递到精化网络,该网络利用参考图像和对齐图像获得亚像素级对应关系,进而给出最终几何变换。大量评估表明,我们的方法显著优于当前最先进方法和基线模型,即使在未见过的真实数据上也展现出良好的泛化能力和视觉上合理的结果。