Robots operating in non-stationary environments must continually adapt their policies as the dynamics drift, but onboard energy and compute budgets cap how often a full state estimation and re-planning step can be performed. This raises a question: \emph{when}, along a horizon, should a robot spend its limited budget? We formulate this problem in time-varying Markov decision processes (TVMDPs) with a known bound on the rate of transition drift. We model execution as a \emph{skip-update} scheme in which, at chosen update times, the agent estimates the transition kernel by maximum likelihood and computes a finite-horizon policy, and between updates reuses this policy under a propagated state estimate. We analyze the dynamic regret of this scheme and show how it grows during skip intervals in terms of the properties of the TVMDP and the skip lengths; the resulting bound answers the opening question via an online, regret-guided update rule that allocates the budget adaptively. We evaluate the rule in a simulated Mars-rover navigation task with time-varying slip dynamics and on a Crazyflie quadrotor in indoor obstacle fields. Adaptive allocation outperforms other budgeted baselines.
翻译:运行于非平稳环境中的机器人必须随着动力学的漂移持续调整其策略,但机载能量与计算预算限制了全面状态估计与重规划步骤的执行频率。这引发了一个问题:在时间序列上,机器人应何时将有限的预算投入使用?我们在具有已知转移漂移速率的时变马尔可夫决策过程(TVMDP)中对此问题进行了建模。我们将执行过程建模为一种"跳跃更新"方案:在选定的更新时刻,智能体通过最大似然法估计转移核并计算有限时域策略,而在两次更新之间,则利用传播后的状态估计重复使用该策略。我们分析了该方案的动态遗憾,并展示了遗憾值如何根据TVMDP的属性与跳跃长度在跳跃区间内增长;由此得出的界限通过一种在线、遗憾引导的自适应预算分配更新规则,回答了开篇提出的问题。我们在具有时变滑移动态的火星车导航仿真任务以及室内障碍场中的Crazyflie四旋翼飞行器上对该规则进行了评估。自适应分配方法的性能优于其他预算受限的基线方法。