Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
翻译:离线策略评估(OPE)允许我们利用从其他策略收集的历史交互数据来评估和估计新序列决策策略的性能。若在缺乏对其性能的可靠估计的情况下在线评估新策略,可能导致代价高昂、不安全甚至危险的后果,这在教育和医疗领域尤为突出。过去十年中已提出了多种OPE估计器,其中许多包含超参数且需要训练。然而,如何为每个任务和领域选择最佳的OPE算法仍不明确。本文提出一种新算法,该算法能在给定数据集的情况下,通过统计过程自适应地融合一组OPE估计器,而无需依赖显式选择。我们证明了所提估计器具有一致性,并满足策略评估所需的若干理想性质。此外,实验表明相较于现有方法,我们的估计器可用于在医疗和机器人领域选择性能更优的策略。本研究通过构建一个通用、估计器无关的离线强化学习离策略评估框架,为提升易用性作出了贡献。