Reinforcement learning algorithms typically consider discrete-time dynamics, even though the underlying systems are often continuous in time. In this paper, we introduce a model-based reinforcement learning algorithm that represents continuous-time dynamics using nonlinear ordinary differential equations (ODEs). We capture epistemic uncertainty using well-calibrated probabilistic models, and use the optimistic principle for exploration. Our regret bounds surface the importance of the measurement selection strategy(MSS), since in continuous time we not only must decide how to explore, but also when to observe the underlying system. Our analysis demonstrates that the regret is sublinear when modeling ODEs with Gaussian Processes (GP) for common choices of MSS, such as equidistant sampling. Additionally, we propose an adaptive, data-dependent, practical MSS that, when combined with GP dynamics, also achieves sublinear regret with significantly fewer samples. We showcase the benefits of continuous-time modeling over its discrete-time counterpart, as well as our proposed adaptive MSS over standard baselines, on several applications.
翻译:强化学习算法通常考虑离散时间动力学,尽管其底层系统往往是时间连续的。本文提出一种基于模型的强化学习算法,通过非线性常微分方程(ODE)表示连续时间动力学。我们利用校准良好的概率模型捕获认知不确定性,并采用乐观原则进行探索。我们的遗憾界揭示了测量选择策略(MSS)的重要性,因为在连续时间场景中,我们不仅需要决定如何探索,还需确定何时观测底层系统。分析表明:当使用高斯过程(GP)对ODE建模时,对于等距采样等常见MSS选择,遗憾呈次线性增长。此外,我们提出一种自适应的、数据驱动的实用MSS策略,该策略结合GP动力学后仅需显著更少的样本即可实现次线性遗憾。通过在多个应用中的实验,我们验证了连续时间建模相较于离散时间建模的优势,以及所提出的自适应MSS相较于标准基准方法的优越性。