Rovers rely on perception to maintain spatial maps that encode both objects and sensor quality (e.g., range reliability, lighting artifacts, data density), guiding data fusion, embedding updates, and navigation under partial observability. To study these coupled perception-navigation processes, we present CrossMaps, a real-time confidence-aware open-vocabulary semantic mapping pipeline that constructs language-queryable maps from RGB-D data. Building on VLMaps-style approaches, CrossMaps integrates multi-scale CLIP embeddings with confidence-aware fusion and a dual-memory architecture consisting of Short-Term Memory (STM) and Long-Term Memory (LTM). The STM aggregates noisy visual observations using geometric, semantic, and temporal confidence cues, while confident and coherent cells are promoted to the LTM as persistent semantic landmarks. Designed for deployment with a Jetson Orin-powered UGV alongside SLAM, CrossMaps runs in real time and produces semantic heatmaps that can be queried with natural language to guide rover navigation.
翻译:巡视器依赖感知系统来维护同时编码物体与传感器质量(如测距可靠性、光照伪影、数据密度)的空间地图,从而引导部分可观测条件下的数据融合、嵌入更新与导航决策。为研究此类耦合的感知-导航过程,我们提出CrossMaps——一种实时置信度感知的开放词汇语义地图构建流水线,能够从RGB-D数据中构建可语言查询的地图。基于VLMaps类方法,CrossMaps集成了多尺度CLIP嵌入、置信度感知融合,以及由短期记忆(STM)与长期记忆(LTM)构成的双记忆架构。STM利用几何、语义和时间置信度线索聚合含噪视觉观测,而置信度高且连贯的单元将被提升至LTM,作为持久性语义地标。专为搭载Jetson Orin的UGV与SLAM协同部署而设计,CrossMaps能实时运行并生成可通过自然语言查询的语义热力图,用于引导巡视器导航。