Residual Learning for Image Point Descriptors

Local image feature descriptors have had a tremendous impact on the development and application of computer vision methods. It is therefore unsurprising that significant efforts are being made for learning-based image point descriptors. However, the advantage of learned methods over handcrafted methods in real applications is subtle and more nuanced than expected. Moreover, handcrafted descriptors such as SIFT and SURF still perform better point localization in Structure-from-Motion (SfM) compared to many learned counterparts. In this paper, we propose a very simple and effective approach to learning local image descriptors by using a hand-crafted detector and descriptor. Specifically, we choose to learn only the descriptors, supported by handcrafted descriptors while discarding the point localization head. We optimize the final descriptor by leveraging the knowledge already present in the handcrafted descriptor. Such an approach of optimization allows us to discard learning knowledge already present in non-differentiable functions such as the hand-crafted descriptors and only learn the residual knowledge in the main network branch. This offers 50X convergence speed compared to the standard baseline architecture of SuperPoint while at inference the combined descriptor provides superior performance over the learned and hand-crafted descriptors. This is done with minor increase in the computations over the baseline learned descriptor. Our approach has potential applications in ensemble learning and learning with non-differentiable functions. We perform experiments in matching, camera localization and Structure-from-Motion in order to showcase the advantages of our approach.

翻译：局部图像特征描述符对计算机视觉方法的发展与应用产生了巨大影响。因此，针对基于学习的图像点描述符投入大量研究工作不足为奇。然而，在实际应用中，学习方法相较于手工方法的优势并不明显，且其表现比预期更为微妙。此外，在运动恢复结构（Structure-from-Motion, SfM）中，SIFT、SURF等手工描述符在点定位精度上仍优于许多基于学习的描述符。本文提出一种简单而有效的局部图像描述符学习方法，该方法利用手工设计的检测器和描述符。具体而言，我们选择仅学习描述符，以手工描述符为支撑，同时舍弃点定位分支。通过利用手工描述符中已有的知识优化最终描述符。这种优化策略使我们能够摒弃对非可微函数（如手工描述符）中已蕴含知识的学习，仅在主网络分支中学习残差知识。与SuperPoint标准基线架构相比，本方法收敛速度提升50倍，而在推理阶段，组合描述符的性能优于纯学习型与手工描述符。相较于基线学习描述符，计算量仅小幅增加。该方法在集成学习及与非可微函数结合的学习任务中具有潜在应用价值。我们通过匹配、相机定位和运动恢复结构实验展示了本方法的优势。