Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.
翻译:强化学习旨在从数据中识别和评估高效的控制策略。在许多实际应用中,学习器不允许进行实验,也无法以在线方式收集数据(当实验成本高昂、风险高或不符合伦理时便是如此)。对于此类应用,必须使用在不同策略(行为策略)下收集的历史数据来评估给定策略(目标策略)的奖励。这种学习任务称为离策略评估(OPE),而大多数方法无法提供准确性和确定性保证。我们提出了一种基于保形预测的新型OPE方法,该方法能够以规定的确定性水平输出包含目标策略真实奖励的区间。OPE的主要挑战源于目标策略与行为策略之间的差异所导致的分布偏移。我们提出了多种处理这种偏移的方法,并进行了实证评估。其中一些方法在保持相同确定性水平的同时,相较于现有方法能够生成长度更短的保形化区间。