Delayed Markov decision processes fulfill the Markov property by augmenting the state space of agents with a finite time window of recently committed actions. In reliance with these state augmentations, delay-resolved reinforcement learning algorithms train policies to learn optimal interactions with environments featured with observation or action delays. Although such methods can directly be trained on the real robots, due to sample inefficiency, limited resources or safety constraints, a common approach is to transfer models trained in simulation to the physical robot. However, robotic simulations rely on approximated models of the physical systems, which hinders the sim2real transfer. In this work, we consider various uncertainties in the modelling of the robot's dynamics as unknown intrinsic disturbances applied on the system input. We introduce a disturbance-augmented Markov decision process in delayed settings as a novel representation to incorporate disturbance estimation in training on-policy reinforcement learning algorithms. The proposed method is validated across several metrics on learning a robotic reaching task and compared with disturbance-unaware baselines. The results show that the disturbance-augmented models can achieve higher stabilization and robustness in the control response, which in turn improves the prospects of successful sim2real transfer.
翻译:延迟马尔可夫决策过程通过将最近执行动作的有限时间窗口纳入智能体的状态空间,从而满足马尔可夫性质。基于此类状态增广,延迟解析强化学习算法训练策略以学习与特征包含观测或动作延迟的环境的最优交互。尽管此类方法可直接在真实机器人上训练,但由于样本效率低、资源受限或安全约束,常见做法是将仿真训练的模型迁移至物理机器人。然而,机器人仿真依赖对物理系统的近似建模,这阻碍了仿真到真实(sim2real)的迁移。本研究将机器人动力学建模中的多种不确定性视为作用于系统输入的未知内禀扰动,提出一种延迟设定下的扰动增广马尔可夫决策过程,作为将扰动估计融入在线策略强化学习训练的新颖表征。该方法在机器人到达任务学习中的多项指标上得到验证,并与未感知扰动的基线方法进行对比。结果表明,扰动增广模型在控制响应中可实现更高的稳定性和鲁棒性,从而提升Sim2Real迁移成功的可能性。