In real-world applications of reinforcement learning, it is often challenging to obtain a state representation that is parsimonious and satisfies the Markov property without prior knowledge. Consequently, it is common practice to construct a state which is larger than necessary, e.g., by concatenating measurements over contiguous time points. However, needlessly increasing the dimension of the state can slow learning and obfuscate the learned policy. We introduce the notion of a minimal sufficient state in a Markov decision process (MDP) as the smallest subvector of the original state under which the process remains an MDP and shares the same optimal policy as the original process. We propose a novel sequential knockoffs (SEEK) algorithm that estimates the minimal sufficient state in a system with high-dimensional complex nonlinear dynamics. In large samples, the proposed method controls the false discovery rate, and selects all sufficient variables with probability approaching one. As the method is agnostic to the reinforcement learning algorithm being applied, it benefits downstream tasks such as policy optimization. Empirical experiments verify theoretical results and show the proposed approach outperforms several competing methods in terms of variable selection accuracy and regret.
翻译:在强化学习的实际应用中,若无先验知识,通常难以获得既简洁又满足马尔可夫性质的状态表征。因此,常见做法是构建一个比实际需求更大的状态空间,例如通过拼接连续时间点的观测值。然而,不必要地增加状态维度会降低学习速度并模糊学习到的策略。我们引入马尔可夫决策过程中最小充分状态的概念,将其定义为原始状态的最小子向量,使得该过程保持马尔可夫决策过程性质且与原始过程共享相同的最优策略。本文提出一种新颖的序列式敲除算法,用于估计具有高维复杂非线性动态系统中的最小充分状态。在大样本条件下,该方法能控制错误发现率,并以趋近于1的概率选择所有充分变量。由于该方法与所应用的强化学习算法无关,因此可惠及策略优化等下游任务。实验结果验证了理论结论,表明所提方法在变量选择准确性和累积遗憾值方面均优于多种现有方法。