Most reinforcement learning algorithms treat the context under which they operate as a stationary, isolated and undisturbed environment. However, in the real world, the environment is constantly changing due to a variety of external influences. To address this problem, we study Markov Decision Processes (MDP) under the influence of an external temporal process. We formalize this notion and discuss conditions under which the problem becomes tractable with suitable solutions. We propose a policy iteration algorithm to solve this problem and theoretically analyze its performance.
翻译:大多数强化学习算法将其运行环境视为平稳、孤立且不受干扰的静态环境。然而在现实世界中,环境会因多种外部影响而持续变化。针对这一问题,我们研究了受外部时间过程影响的马尔可夫决策过程(MDP)。我们形式化定义了该概念,并讨论了问题在何种条件下可通过适当方法得以解决。我们提出了一种策略迭代算法来处理这一问题,并从理论上分析了其性能。