We propose FocusTune, a focus-guided sampling technique to improve the performance of visual localization algorithms. FocusTune directs a scene coordinate regression model towards regions critical for 3D point triangulation by exploiting key geometric constraints. Specifically, rather than uniformly sampling points across the image for training the scene coordinate regression model, we instead re-project 3D scene coordinates onto the 2D image plane and sample within a local neighborhood of the re-projected points. While our proposed sampling strategy is generally applicable, we showcase FocusTune by integrating it with the recently introduced Accelerated Coordinate Encoding (ACE) model. Our results demonstrate that FocusTune both improves or matches state-of-the-art performance whilst keeping ACE's appealing low storage and compute requirements, for example reducing translation error from 25 to 19 and 17 to 15 cm for single and ensemble models, respectively, on the Cambridge Landmarks dataset. This combination of high performance and low compute and storage requirements is particularly promising for applications in areas like mobile robotics and augmented reality. We made our code available at \url{https://github.com/sontung/focus-tune}.
翻译:我们提出FocusTune——一种焦点引导采样技术,用于提升视觉定位算法的性能。FocusTune利用关键几何约束,引导场景坐标回归模型聚焦于对三维点三角化至关重要的区域。具体而言,该方法不再对整个图像中的点进行均匀采样来训练场景坐标回归模型,而是将三维场景坐标重新投影到二维图像平面,并在重投影点的局部邻域内进行采样。尽管我们提出的采样策略具有通用性,但通过将其与近期提出的加速坐标编码(Accelerated Coordinate Encoding, ACE)模型集成,展示了FocusTune的实际效果。实验结果表明,在保持ACE低存储与低计算开销优势的同时,FocusTune能够提升或媲美当前最优性能——例如,在剑桥地标数据集(Cambridge Landmarks)上,单模型与集成模型的平移误差分别从25厘米降至19厘米、从17厘米降至15厘米。这种高性能与低计算、低存储需求的结合,对移动机器人和增强现实等领域的应用尤为具有潜力。我们已将代码开源至 \url{https://github.com/sontung/focus-tune}。