We study evaluating a policy under best- and worst-case perturbations to a Markov decision process (MDP), given transition observations from the original MDP, whether under the same or different policy. This is an important problem when there is the possibility of a shift between historical and future environments, due to e.g. unmeasured confounding, distributional shift, or an adversarial environment. We propose a perturbation model that can modify transition kernel densities up to a given multiplicative factor or its reciprocal, which extends the classic marginal sensitivity model (MSM) for single time step decision making to infinite-horizon RL. We characterize the sharp bounds on policy value under this model, that is, the tightest possible bounds given by the transition observations from the original MDP, and we study the estimation of these bounds from such transition observations. We develop an estimator with several appealing guarantees: it is semiparametrically efficient, and remains so even when certain necessary nuisance functions such as worst-case Q-functions are estimated at slow nonparametric rates. It is also asymptotically normal, enabling easy statistical inference using Wald confidence intervals. In addition, when certain nuisances are estimated inconsistently we still estimate a valid, albeit possibly not sharp bounds on the policy value. We validate these properties in numeric simulations. The combination of accounting for environment shifts from train to test (robustness), being insensitive to nuisance-function estimation (orthogonality), and accounting for having only finite samples to learn from (inference) together leads to credible and reliable policy evaluation.
翻译:我们研究在给定原始马尔可夫决策过程(MDP)的转移观测数据(无论来自相同还是不同策略)的情况下,评估策略在最佳和最坏情况扰动下的性能。这一问题在历史环境与未来环境可能存在偏移时至关重要,例如由于未测量的混淆、分布偏移或对抗性环境。本文提出一种扰动模型,可将转移核密度修改至给定乘性因子或其倒数范围内,该模型将经典的单时间步决策边际敏感性模型(MSM)扩展至无限时域强化学习。我们刻画了该模型下策略价值的尖锐界,即由原始MDP转移观测数据所能给出的最紧凑边界,并研究如何从这些转移观测数据中估计这些边界。我们开发了一种估计器,具有若干令人满意的保证:它是半参数有效的,且即使在必要干扰函数(如最坏情况Q函数)以非参数慢速率估计时仍保持有效性。该估计值渐近正态,使得可通过Wald置信区间进行简便统计推断。此外,当某些干扰函数估计不一致时,我们仍能对策略价值给出有效(尽管可能非尖锐)的界。我们通过数值模拟验证了这些性质。综合考量训练环境到测试环境的偏移(鲁棒性)、对干扰函数估计的不敏感性(正交性)以及有限样本学习能力(推断),共同确保了可信且可靠的策略评估。