Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quantizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncompressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory-limited robotic platforms, making TeTRA an appealing solution for real-world deployment.
翻译:视觉位置识别(VPR)通过将查询图像与地理标记的参考图像数据库进行匹配来实现定位,这对于机器人导航与建图至关重要。尽管视觉Transformer(ViT)方案能提供高精度,但其庞大的模型规模常常超出无人机和移动机器人等资源受限平台的存储与计算预算。为解决此问题,我们提出TeTRA,一种三元Transformer方法,该方法将ViT骨干网络逐步量化至2比特精度并将其最终嵌入层二值化,从而显著降低模型大小与延迟。通过精心设计的渐进式蒸馏策略,TeTRA保留了全精度教师模型的表征能力,使其在资源消耗更少的情况下,仍能保持甚至超越未压缩卷积模型的精度。在标准VPR基准测试上的实验表明,与高效基线模型相比,TeTRA将内存消耗降低了高达69%,同时将推理延迟降低了35%,且在召回率@1指标上无损失或有轻微提升。这些优势使得在功耗受限、内存有限的机器人平台上实现高精度VPR成为可能,使TeTRA成为实际部署中极具吸引力的解决方案。