Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).
翻译:部署在人类中心环境中的机器人经常接收涉及视野之外物体位置的自然语言空间描述(例如“我把我的背包放在桌上”)。传统的度量语义地图忽略了这类信号,而现有多模态模型在3D空间推理方面仍存在局限性,且难以与其他传感器模态直接融合。为将语言观测转化为校准的空间分布,我们训练了一个语言传感器模型(LSM),该模型将每个话语及其场景图上下文映射为多模态分布,其中混合权重编码指代歧义(如“哪张桌子”),分量协方差编码空间不确定性(如目标位于“桌上”何处)。随后提出VL-Map(视觉语言度量语义地图),这是一个概率框架,将语言预测视为随机观测,并将其与机载感知融合到统一的可信图中。在VLA-3D基准测试及真实移动机器人实验中,LSM是唯一协方差估计保持在校准范围内的语言预测器;融合到VL-Map后,相较于最强的基石模型基线,其对目标物体位置的预测更准确(真实目标上的概率质量增加约70%)。