Many real-world offline reinforcement learning (RL) problems involve continuous-time environments with delays. Such environments are characterized by two distinctive features: firstly, the state x(t) is observed at irregular time intervals, and secondly, the current action a(t) only affects the future state x(t + g) with an unknown delay g > 0. A prime example of such an environment is satellite control where the communication link between earth and a satellite causes irregular observations and delays. Existing offline RL algorithms have achieved success in environments with irregularly observed states in time or known delays. However, environments involving both irregular observations in time and unknown delays remains an open and challenging problem. To this end, we propose Neural Laplace Control, a continuous-time model-based offline RL method that combines a Neural Laplace dynamics model with a model predictive control (MPC) planner--and is able to learn from an offline dataset sampled with irregular time intervals from an environment that has a inherent unknown constant delay. We show experimentally on continuous-time delayed environments it is able to achieve near expert policy performance.
翻译:许多现实世界中的离线强化学习(RL)问题涉及带有延迟的连续时间环境。这类环境具有两个显著特征:首先,状态x(t)以不规则时间间隔被观测;其次,当前动作a(t)仅会在未知延迟g>0后影响未来状态x(t+g)。卫星控制是此类环境的典型实例——地球与卫星之间的通信链路会导致不规则观测与传输延迟。现有的离线强化学习算法已在处理时间上不规则观测状态或已知延迟的环境中取得成功。然而,同时涉及时间不规则观测与未知延迟的环境仍是一个待解决的开放性难题。为此,我们提出"神经拉普拉斯控制"方法——一种基于连续时间模型的离线强化学习算法,该算法将神经拉普拉斯动力学模型与模型预测控制(MPC)规划器相结合,能够从具有固有未知恒定延迟且采样间隔不规则的环境中学习离线数据集。实验表明,在连续时间延迟环境中,该方法可达到接近专家策略的性能水平。