The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps. The brute-force state augmentation baseline where the state is concatenated to the last $m$ committed actions suffers from an exponential complexity in $m$, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at github.com/galdl/rl_delay_basic and github.com/galdl/rl_delay_atari.
翻译:标准马尔可夫决策过程(MDP)公式隐含假设动作选定后立即执行。然而,这一假设往往不切实际,并可能在机器人操控、云计算和金融等应用中导致灾难性故障。我们提出一种适用于MDP的学习与规划框架,其中决策者提交的动作在执行前存在$m$步延迟。原始状态增强基线方法将状态与最近$m$个已提交动作拼接,我们证明其在策略迭代中会遭遇指数级复杂度增长。接着我们证明:在执行延迟条件下,原始状态空间中的确定性马尔可夫策略虽足以获得最大奖励,但必须是非平稳的。至于平稳马尔可夫策略,我们证明其普遍存在次优性。基于此,我们设计了一种非平稳Q学习风格的模型基算法,无需依赖状态增强即可解决延迟执行任务。在表格、物理和Atari领域的实验表明,即使面对显著延迟,该方法仍能快速收敛至高性能表现,而忽视延迟或依赖状态增强的标准方法则因发散而遭遇困境或彻底失败。代码已开源至github.com/galdl/rl_delay_basic和github.com/galdl/rl_delay_atari。