Visual place recognition (VPR) is a challenging task with the unbalance between enormous computational cost and high recognition performance. Thanks to the practical feature extraction ability of the lightweight convolution neural networks (CNNs) and the train-ability of the vector of locally aggregated descriptors (VLAD) layer, we propose a lightweight weakly supervised end-to-end neural network consisting of a front-ended perception model called GhostCNN and a learnable VLAD layer as a back-end. GhostCNN is based on Ghost modules that are lightweight CNN-based architectures. They can generate redundant feature maps using linear operations instead of the traditional convolution process, making a good trade-off between computation resources and recognition accuracy. To enhance our proposed lightweight model further, we add dilated convolutions to the Ghost module to get features containing more spatial semantic information, improving accuracy. Finally, rich experiments conducted on a commonly used public benchmark and our private dataset validate that the proposed neural network reduces the FLOPs and parameters of VGG16-NetVLAD by 99.04% and 80.16%, respectively. Besides, both models achieve similar accuracy.
翻译:视觉地点识别(VPR)是一项在巨大计算开销与高识别性能之间寻求平衡的挑战性任务。得益于轻量级卷积神经网络(CNN)高效的实用特征提取能力,以及局部聚合描述符向量(VLAD)层的可训练特性,我们提出了一种轻量级弱监督端到端神经网络,其前端采用名为GhostCNN的感知模型,后端采用可学习的VLAD层。GhostCNN基于Ghost模块(一种轻量级CNN架构),通过线性运算而非传统卷积过程生成冗余特征图,从而在计算资源与识别精度之间实现良好权衡。为进一步增强该轻量级模型,我们在Ghost模块中引入膨胀卷积,以获取包含更多空间语义信息的特征,从而提升精度。最后,在公共基准数据集与私有数据集上的大量实验表明,所提出的神经网络相较于VGG16-NetVLAD,FLOPs与参数量分别降低99.04%与80.16%,同时两者达到相近的识别精度。