We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown. The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the asymptotic variance of the estimator while considering the estimating effect of the logging policy. When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class containing existing OPE estimators. When the value function model is also correctly specified, DRUnknown is optimal as its asymptotic variance reaches the semiparametric lower bound. We present experimental results conducted in contextual bandits and reinforcement learning to compare the performance of DRUnknown with that of existing methods.
翻译:我们提出了一种新颖的双鲁棒离策略评估估计器DRUnknown,专门设计用于日志策略和值函数均未知的马尔可夫决策过程场景。该估计器首先估计日志策略,随后在考虑日志策略估计效应的前提下,通过最小化估计器的渐近方差来估计值函数模型。当日志策略模型指定正确时,DRUnknown在包含现有离策略评估估计器的类别中实现最小渐近方差。当值函数模型也指定正确时,DRUnknown的渐近方差达到半参数下界,具有最优性。我们在情境赌博机和强化学习中进行了实验,比较了DRUnknown与现有方法的性能表现。