Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
翻译:强化学习算法通常针对离散时间动态设计,尽管现实世界中的底层控制系统往往是连续时间的。本文研究连续时间强化学习问题,其中未知系统动态使用非线性常微分方程(ODEs)表示。我们利用概率模型(如高斯过程和贝叶斯神经网络)来学习底层ODE的不确定性感知模型。我们的算法COMBRL贪婪地最大化外部奖励与模型认知不确定性的加权和,从而产生一种可扩展且样本高效的连续时间基于模型的强化学习方法。我们证明,在奖励驱动设置下,COMBRL实现了次线性遗憾;在无监督强化学习设置(即无外部奖励)中,我们提供了样本复杂度界限。实验中,我们在标准和无监督强化学习设置下评估COMBRL,结果表明其比现有方法具有更好的可扩展性、更高的样本效率,并在多个深度强化学习任务中优于基线方法。