Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal. This paper formalizes exogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and exogenous components, the MDP can be decomposed into an exogenous Markov Reward Process (based on the exogenous reward) and an endogenous Markov Decision Process (optimizing the endogenous reward). Any optimal policy for the endogenous MDP is also an optimal policy for the original MDP, but because the endogenous reward typically has reduced variance, the endogenous MDP is easier to solve. We study settings where the decomposition of the state space into exogenous and endogenous state spaces is not given but must be discovered. The paper introduces and proves correctness of algorithms for discovering the exogenous and endogenous subspaces of the state space when they are mixed through linear combination. These algorithms can be applied during reinforcement learning to discover the exogenous subspace, remove the exogenous reward, and focus reinforcement learning on the endogenous MDP. Experiments on a variety of challenging synthetic MDPs show that these methods, applied online, discover large exogenous state spaces and produce substantial speedups in reinforcement learning.
翻译:外生状态变量与奖励会通过向奖励信号中注入不可控的变异,从而减缓强化学习的速度。本文对外生状态变量与奖励进行了形式化定义,并证明若奖励函数可加性分解为内生与外生分量,则马尔可夫决策过程可分解为一个外生马尔可夫奖励过程(基于外生奖励)与一个内生马尔可夫决策过程(用于优化内生奖励)。任何针对内生MDP的最优策略同时也是原始MDP的最优策略,但由于内生奖励通常具有更低的方差,内生MDP更易于求解。我们研究了状态空间分解为外生与内生状态空间的情况,这种分解并非预先给定,而是必须通过探索发现。本文提出并证明了在状态空间通过线性组合混合的情况下,用于发现其外生子空间与内生子空间的算法的正确性。这些算法可在强化学习过程中应用,以发现外生子空间、移除外生奖励,并将强化学习聚焦于内生MDP。在一系列具有挑战性的合成MDP上的实验表明,这些在线应用的方法能够发现大规模的外生状态空间,并显著加速强化学习过程。