Mobile robots are often deployed over long durations in diverse open, dynamic scenes, including indoor setting such as warehouses and manufacturing facilities, and outdoor settings such as agricultural and roadway operations. A core challenge is to build a scalable long-horizon memory that supports an agentic workflow for planning, retrieval, and reasoning over open-ended instructions at variable granularity, while producing precise, actionable answers for navigation. We present STaR, an agentic reasoning framework that (i) constructs a task-agnostic, multimodal long-term memory that generalizes to unseen queries while preserving fine-grained environmental semantics (object attributes, spatial relations, and dynamic events), and (ii) introduces a Scalable TaskConditioned Retrieval algorithm based on the Information Bottleneck principle to extract from long-term memory a compact, non-redundant, information-rich set of candidate memories for contextual reasoning. We evaluate STaR on NaVQA (mixed indoor/outdoor campus scenes) and WH-VQA, a customized warehouse benchmark with many visually similar objects built with Isaac Sim, emphasizing contextual reasoning. Across the two datasets, STaR consistently outperforms strong baselines, achieving higher success rates and markedly lower spatial error. We further deploy STaR on a real Husky wheeled robot in both indoor and outdoor environments, demonstrating robust longhorizon reasoning, scalability, and practical utility.
翻译:移动机器人通常需在多样化的开放动态场景中长期部署,包括仓库和制造设施等室内环境,以及农业和道路作业等室外环境。其核心挑战在于构建一个可扩展的长周期记忆系统,以支持基于开放式多粒度指令的规划、检索与推理的智能体工作流,同时为导航任务生成精确且可执行的答案。本文提出STaR——一个智能体推理框架,其具备以下特点:(i)构建与任务无关的多模态长期记忆,该记忆能够泛化至未见查询,同时保留细粒度的环境语义(物体属性、空间关系及动态事件);(ii)基于信息瓶颈原理提出一种可扩展的任务条件检索算法,从长期记忆中提取紧凑、非冗余且信息丰富的候选记忆集合以支持上下文推理。我们在NaVQA(室内外混合校园场景)与WH-VQA上评估STaR的性能,其中WH-VQA是基于Isaac Sim构建的定制化仓库基准数据集,包含大量视觉相似物体并强调上下文推理能力。在两个数据集上,STaR均持续超越现有强基线模型,实现了更高的任务成功率与显著降低的空间误差。我们进一步在真实Husky轮式机器人上于室内外环境中部署STaR,验证了其在长周期推理、可扩展性及实际应用方面的鲁棒性。