We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
翻译:我们考虑无限时域平均奖励设定下连续时间马尔可夫决策过程(MDP)的强化学习问题。与离散时间MDP不同,连续时间过程在执行动作后会转移至某个状态,并在此状态停留一段随机保持时间。在转移概率和指数保持时间速率未知的条件下,我们推导出依赖于具体实例的对数时间范围遗憾下界。此外,我们设计了一种学习算法,并建立了实现对数增长率的有限时间遗憾界。我们的分析基于上置信界强化学习、对平均保持时间的精细估计以及点过程的随机比较。