There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.
翻译:在情节强化学习中,存在许多旨在最小化遗憾的算法。从理论角度来看,这一问题已得到充分理解,前提是每次与环境交互后,与每个情节相关的状态、动作和奖励序列都能立即用于算法更新策略。然而,在实际应用中,反馈几乎总是存在延迟。本文从理论角度研究了延迟对情节强化学习的影响,并提出了两种处理延迟的通用方法。第一种方法是在新信息可用时立即更新,而第二种方法则等待一段时间后再利用新观察到的信息更新策略。对于乐观算法类及这两种方法,我们证明遗憾会加上一项与状态数量、动作数量、情节长度、预期延迟以及算法相关常数有关的附加项。我们通过实证研究了不同延迟分布对乐观算法遗憾的影响,以验证我们的理论结果。