Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.
翻译:离线策略评估在诸多需在线部署前进行离线评估新策略的应用中至关重要。现有方法大多聚焦于期望回报,通过均值化定义目标参数,仅提供点估计。本文提出一种创新流程,可为目标策略从任意初始状态出发的回报生成可靠的区间估计。该方案能够刻画回报围绕期望值的变异性,聚焦个体效应,并提供有效的量化不确定性。核心思想在于设计一种伪策略,使得生成的子样本如同从目标策略中采样所得,进而使现有保形预测算法适用于预测区间构建。我们的方法在理论分析、合成数据以及短视频平台真实数据中均得到了验证。