Sequence-based visual place recognition (sVPR) aims to match frame sequences with frames stored in a reference map for localization. Existing methods include sequence matching and sequence descriptor-based retrieval. The former is based on the assumption of constant velocity, which is difficult to hold in real scenarios and does not get rid of the intrinsic single frame descriptor mismatch. The latter solves this problem by extracting a descriptor for the whole sequence, but current sequence descriptors are only constructed by feature aggregation of multi-frames, with no temporal information interaction. In this paper, we propose a sequential descriptor extraction method to fuse spatiotemporal information effectively and generate discriminative descriptors. Specifically, similar features on the same frame focu on each other and learn space structure, and the same local regions of different frames learn local feature changes over time. And we use sliding windows to control the temporal self-attention range and adpot relative position encoding to construct the positional relationships between different features, which allows our descriptor to capture the inherent dynamics in the frame sequence and local feature motion.
翻译:基于序列的视觉地点识别(sVPR)旨在将帧序列与参考地图中存储的帧进行匹配以实现定位。现有方法包括序列匹配和基于序列描述符的检索。前者基于匀速假设,该假设在真实场景中难以成立,且无法摆脱固有的单帧描述符不匹配问题。后者通过为整个序列提取描述符来解决此问题,但当前序列描述符仅通过多帧特征聚合构建,缺乏时序信息交互。本文提出一种序列描述符提取方法,以有效融合时空信息并生成具有判别性的描述符。具体而言,同一帧内的相似特征相互聚焦并学习空间结构,而不同帧的相同局部区域则学习局部特征随时间的变化。我们采用滑动窗口控制时序自注意力的范围,并利用相对位置编码构建不同特征之间的位置关系,从而使所提出的描述符能够捕捉帧序列中的固有动态变化及局部特征运动。