We study the policy testing problem in discounted Markov decision processes (MDPs) in the fixed-confidence setting under a generative model with static sampling. The goal is to decide whether the value of a given policy exceeds a specified threshold while minimizing the number of samples. We first derive an instance-dependent lower bound that any reasonable algorithm must satisfy, characterized as the solution to an optimization problem with non-convex constraints. Guided by this formulation, we propose a new algorithm. While this design paradigm is common in pure exploration problems such as best-arm identification, the non-convex constraints that arise in MDPs introduce substantial difficulties. To address them, we reformulate the lower-bound problem by swapping the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. This reformulation admits an interpretation as a policy optimization task in a newly constructed reversed MDP. We further show that the global KL constraint can be decomposed exactly into a family of product-box subproblems, which are solved by projected policy gradient and combined through an outer budget search. Beyond policy testing, our reformulation and reversed MDP view suggest extensions to other pure exploration tasks in MDPs, including policy evaluation and best policy identification.
翻译:我们研究了折扣马尔可夫决策过程(MDPs)中固定置信度设置下的策略检验问题,该问题基于静态采样的生成模型。其目标是在最小化样本量的条件下,判断给定策略的值是否超过指定阈值。我们首先推导了任何合理算法必须满足的实例相关下界,其特征为具有非凸约束的优化问题的解。在此公式的指导下,我们提出了一种新算法。虽然这种设计范式在最佳臂识别等纯探索问题中常见,但MDP中出现的非凸约束带来了重大困难。为解决这些问题,我们通过交换目标函数与约束的角色对下界问题进行重构,得到目标函数非凸但约束凸的替代问题。该重构可解释为在新构建的逆向MDP中的策略优化任务。我们进一步证明,全局KL约束可精确分解为一组乘积盒子子问题,这些子问题通过投影策略梯度求解,并结合外部预算搜索进行整合。除策略检验外,我们的重构与逆向MDP视角还可推广至MDP中的其他纯探索任务,包括策略评估与最佳策略识别。