Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal. This paper formalizes exogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and exogenous components, the MDP can be decomposed into an exogenous Markov Reward Process (based on the exogenous reward) and an endogenous Markov Decision Process (optimizing the endogenous reward). Any optimal policy for the endogenous MDP is also an optimal policy for the original MDP, but because the endogenous reward typically has reduced variance, the endogenous MDP is easier to solve. We study settings where the decomposition of the state space into exogenous and endogenous state spaces is not given but must be discovered. The paper introduces and proves correctness of algorithms for discovering the exogenous and endogenous subspaces of the state space when they are mixed through linear combination. These algorithms can be applied during reinforcement learning to discover the exogenous space, remove the exogenous reward, and focus reinforcement learning on the endogenous MDP. Experiments on a variety of challenging synthetic MDPs show that these methods, applied online, discover large exogenous state spaces and produce substantial speedups in reinforcement learning.
翻译:外生状态变量与奖励会通过向奖励信号中注入不可控的变化,从而延缓强化学习进程。本文对外生状态变量与奖励进行了形式化定义,并证明:若奖励函数可加性分解为内生与外生成分,则马尔可夫决策过程可分解为外生马尔可夫奖励过程(基于外生奖励)与内生马尔可夫决策过程(优化内生奖励)。内生马尔可夫决策过程的任意最优策略同样是原马尔可夫决策过程的最优策略,但由于内生奖励通常具有更低的方差,因此更易于求解。本文研究状态空间在未给定而是需要被发现时,其外生与内生子空间的分解问题。我们提出并证明了在状态子空间通过线性组合混合时,发现外生与内生子空间的算法正确性。这些算法可在强化学习过程中在线应用,用于发现外生空间、移除外生奖励,并将强化学习聚焦于内生马尔可夫决策过程。在多种具有挑战性的合成马尔可夫决策过程上的实验表明,这些在线方法能够发现大规模外生状态空间,并显著加速强化学习过程。