In real-world applications of reinforcement learning, it is often challenging to obtain a state representation that is parsimonious and satisfies the Markov property without prior knowledge. Consequently, it is common practice to construct a state larger than necessary, e.g., by concatenating measurements over contiguous time points. However, needlessly increasing the dimension of the state may slow learning and obfuscate the learned policy. We introduce the notion of a minimal sufficient state in a Markov decision process (MDP) as the subvector of the original state under which the process remains an MDP and shares the same reward function as the original process. We propose a novel SEquEntial Knockoffs (SEEK) algorithm that estimates the minimal sufficient state in a system with high-dimensional complex nonlinear dynamics. In large samples, the proposed method achieves selection consistency. As the method is agnostic to the reinforcement learning algorithm being applied, it benefits downstream tasks such as policy learning. Empirical experiments verify theoretical results and show the proposed approach outperforms several competing methods regarding variable selection accuracy and regret.
翻译:在强化学习的实际应用中,通常难以在缺乏先验知识的情况下获得既简约又满足马尔可夫性的状态表示。因此,常见的做法是构建比必要维度更大的状态,例如通过连接连续时间点的观测值。然而,不必要地增加状态维度可能减缓学习速度并使习得策略变得模糊。我们引入了马尔可夫决策过程(MDP)中最小充分状态的概念,即原始状态的一个子向量,在该子向量下过程仍保持为MDP且与原始过程共享相同的奖励函数。我们提出了一种新颖的序贯Knockoffs(SEEK)算法,用于估计具有高维复杂非线性动力学的系统中的最小充分状态。在大样本条件下,所提方法能够实现选择一致性。由于该方法对所使用的强化学习算法具有不可知性,因此有利于策略学习等下游任务。实证实验验证了理论结果,并表明所提方法在变量选择准确性和遗憾值方面优于多种竞争方法。