We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets. Our project page is at: https://junwoonlee.github.io/projects/LatentAM
翻译:本文提出LatentAM,一种在线三维高斯溅射建图框架,能够从连续RGB-D观测中构建可扩展的隐式特征地图,用于开放词汇的机器人感知。与使用模型特定解码器蒸馏高维视觉语言模型嵌入的方法不同,LatentAM提出了一种在线字典学习方法,该方法兼具模型无关性与免预训练特性,支持在测试时即插即用地集成不同视觉语言模型。具体而言,我们的方法为每个高斯基元关联一个紧凑查询向量,该向量可通过带有可学习字典的注意力机制转换为近似的视觉语言模型嵌入。该字典从流式观测中高效初始化,并在信任域正则化约束下在线优化,以适应动态演化的场景语义。为适应长轨迹与大尺度环境,我们进一步提出基于体素哈希的高效地图管理策略:将优化限制在GPU上的局部活动地图,同时将全局地图存储并索引于CPU,从而维持有界的GPU内存占用。在公开基准测试和大规模定制数据集上的实验表明,与现有先进方法相比,LatentAM在特征重建保真度上取得显著提升,同时在评估数据集上达到近实时运行速度(12-35 FPS)。项目页面详见:https://junwoonlee.github.io/projects/LatentAM