We present a novel approach for fast and reliable policy selection for navigation in partial maps. Leveraging the recent learning-augmented model-based Learning over Subgoals Planning (LSP) abstraction to plan, our robot reuses data collected during navigation to evaluate how well other alternative policies could have performed via a procedure we call offline alt-policy replay. Costs from offline alt-policy replay constrain policy selection among the LSP-based policies during deployment, allowing for improvements in convergence speed, cumulative regret and average navigation cost. With only limited prior knowledge about the nature of unseen environments, we achieve at least 67% and as much as 96% improvements on cumulative regret over the baseline bandit approach in our experiments in simulated maze and office-like environments.
翻译:我们提出了一种新颖的方法,用于在部分地图中快速且可靠地选择导航策略。利用近期提出的基于学习增强的规划方法——子目标学习规划(LSP)抽象进行规划,我们的机器人在导航过程中复用收集的数据,通过一种称为离线替代策略回放的流程,评估其他可选策略本可能达到的效果。离线替代策略回放的代价约束了部署过程中基于LSP策略间的选择,从而在收敛速度、累计遗憾和平均导航代价方面实现改进。在仅对未知环境性质拥有有限先验知识的情况下,我们在模拟迷宫和类办公室环境的实验中,相较于基线多臂赌博机方法,累计遗憾至少改进了67%,最高可达96%。