We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
翻译:我们研究无限时间范围、平均奖励设定下的连续时间马尔可夫决策过程(MDPs)的强化学习。与离散时间MDPs不同,连续时间过程在执行动作后会转移至某个状态,并在该状态停留一段随机的保持时间。在转移概率和指数保持时间速率未知的情况下,我们推导出与时间范围成对数关系的实例依赖遗憾下界。此外,我们设计了一种学习算法,并建立了达到该对数增长率的有限时间遗憾上界。我们的分析基于置信上界强化学习、对平均保持时间的精细估计以及点过程的随机比较。