Assisting humans in open-world outdoor environments requires robots to translate high-level natural-language intentions into safe, long-horizon, and socially compliant navigation behavior. Existing map-based methods rely on costly pre-built HD maps, while learning-based policies are mostly limited to indoor and short-horizon settings. To bridge this gap, we propose Walk with Me, a map-free framework for long-horizon social navigation from high-level human instructions. Walk with Me leverages GPS context and lightweight candidate points-of-interest from a public map API for semantic destination grounding and waypoint proposal. A High-Level Vision-Language Model grounds abstract instructions into concrete destinations and plans coarse waypoint sequences. During execution, an observation-aware routing mechanism determines whether the Low-Level Vision-Language-Action policy can handle the current situation or whether explicit safety reasoning from the High-Level VLM is needed. Routine segments are executed by the Low-Level VLA, while complex situations such as crowded crossings trigger high-level reasoning and stop-and-wait behavior when unsafe. By combining semantic intent grounding, map-free long-horizon planning, safety-aware reasoning, and low-level action generation, Walk with Me enables practical outdoor social navigation for human-centric assistance.
翻译:在开放世界户外环境中协助人类,需要机器人将高级自然语言意图转化为安全、长时域且符合社会规范的导航行为。现有基于地图的方法依赖昂贵预建的高清地图,而基于学习的策略大多局限于室内和短时域场景。为弥合这一差距,我们提出“与我同行”(Walk with Me)——一种无需地图的框架,用于从高级人类指令出发进行长时域社交导航。该框架利用GPS背景信息及公共地图API中的轻量级候选兴趣点,实现语义目的地定位与路径点提议。高级视觉语言模型将抽象指令具象化为具体目的地,并规划粗略的路径点序列。在执行过程中,一种感知感知的路由机制会判断低级视觉-语言-动作策略能否应对当前情境,或是否需要高级视觉语言模型进行显式安全推理。常规路段由低级视觉-语言-动作策略执行,而拥挤路口等复杂场景则会触发高级推理,并在不安全时触发“停等”行为。通过结合语义意图定位、无地图长时域规划、安全感知推理与低级动作生成,“与我同行”实现了面向以人为本辅助的实用型户外社交导航。