Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic. We study inference for the value of optimal policies in Markov decision processes. We characterize the existence of the efficient influence function and show that non-regularity arises under policy non-uniqueness. Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which achieves semiparametric efficiency and retains the double robustness property when the optimal policy is unique, and remains stable in degenerate regimes beyond the scope of existing asymptotic theory. We further develop a smoothing-based approach for valid inference under non-unique optimal policies, and a post-selection procedure with uniform coverage for data-selected optimal policies. Simulation studies support the theoretical results. An application to the OhioT1DM mobile health dataset provides patient-specific confidence intervals for optimal policy values and their improvement over observed treatment policies.
翻译:离策略评估(OPE)利用在不同于目标策略的行为策略下生成的数据,为目标策略的价值构建置信区间。现有的大多数推断方法聚焦于固定的目标策略,当目标策略被估计为最优时(尤其是在最优策略非唯一或近乎确定性的情况下)可能失效。我们研究了马尔可夫决策过程中最优策略价值的推断问题。我们刻画了高效影响函数的存在性,并证明了在策略非唯一性下会出现非正则性。基于此分析,我们提出了一种新颖的\textit{非参数序贯价值评估}(NSAVE)方法,该方法在最优策略唯一时达到半参数效率并保持双重稳健性,且在超出已有渐近理论范围的退化机制中保持稳定。我们进一步开发了一种基于平滑的方法,用于在非唯一最优策略下进行有效推断,以及一种针对数据选择的最优策略具有均匀覆盖性的后选择程序。仿真研究支持了理论结果。在OhioT1DM移动健康数据集上的应用为最优策略价值及其相对于观测治疗策略的改进提供了患者特异性的置信区间。