Image matching that finding robust and accurate correspondences across images is a challenging task under extreme conditions. Capturing local and global features simultaneously is an important way to mitigate such an issue but recent transformer-based decoders were still stuck in the issues that CNN-based encoders only extract local features and the transformers lack locality. Inspired by the locality and implicit positional encoding of convolutions, a novel convolutional transformer is proposed to capture both local contexts and global structures more sufficiently for detector-free matching. Firstly, a universal FPN-like framework captures global structures in self-encoder as well as cross-decoder by transformers and compensates local contexts as well as implicit positional encoding by convolutions. Secondly, a novel convolutional transformer module explores multi-scale long range dependencies by a novel multi-scale attention and further aggregates local information inside dependencies for enhancing locality. Finally, a novel regression-based sub-pixel refinement module exploits the whole fine-grained window features for fine-level positional deviation regression. The proposed method achieves superior performances on a wide range of benchmarks. The code will be available on https://github.com/zwh0527/LGFCTR.
翻译:在极端条件下寻找鲁棒且精确对应关系的图像匹配是一项极具挑战性的任务。同时捕获局部与全局特征是缓解该问题的重要途径,但近期基于Transformer的解码器仍深陷于CNN编码器仅能提取局部特征且Transformer缺乏局部性的困境中。受卷积的局部性与隐式位置编码启发,提出一种新型卷积Transformer以更充分地捕获局部上下文与全局结构,从而实现无检测器匹配。首先,一种类FPN通用框架通过Transformer在自编码器与跨解码器中捕获全局结构,并通过卷积补偿局部上下文与隐式位置编码。其次,一种新颖的卷积Transformer模块通过新型多尺度注意力探索多尺度长程依赖关系,并在依赖关系内部进一步聚合局部信息以增强局部性。最后,一种新型基于回归的子像素精化模块利用完整细粒度窗口特征进行精细级位置偏差回归。所提方法在广泛基准测试中取得优异性能。代码将发布于https://github.com/zwh0527/LGFCTR。