From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

翻译：自主长时域人行道导航对于微出行应用（如机器人送餐和辅助电动轮椅）至关重要。与公路上的自动驾驶不同，长时域人行道导航需要在不可预知的人行道地形和行人中精确操控，且感知栈轻量至仅需单目RGB摄像头。尽管基于示范的模仿学习（IL）提供了一种实用方案，但由此产生的自动驾驶策略常面临累积误差、人行道上缺乏社交合规性以及处理复杂情境的反事实推理能力不足等问题。为应对这些挑战，我们提出FlowPilot——一种仅使用单目RGB摄像头即可实现稳健高效长时域导航性能的无地图导航策略。我们首先提出将锚定流匹配作为动作表示，用于在大规模机器人车队数据上进行策略预训练，并捕捉人行道导航行为的多样复杂多模态分布。为弥合模仿与对齐之间的差距，我们进一步设计了一种人在回路偏好学习方案，在少量人工干预数据上调整策略，从而增强模型在人行道上的反事实推理和社交合规能力。我们通过在多样人行道环境中的大量仿真和真实世界实验评估FlowPilot。仿真中FlowPilot达到42%的成功率和66%的路线完成率，而FlowPilot-HP进一步提升了真实世界的鲁棒性和社交合规性，相对于基础模型，IR降低了40.0%，NIR降低了52.1%。