In applying reinforcement learning (RL) to high-stakes domains, quantitative and qualitative evaluation using observational data can help practitioners understand the generalization performance of new policies. However, this type of off-policy evaluation (OPE) is inherently limited since offline data may not reflect the distribution shifts resulting from the application of new policies. On the other hand, online evaluation by collecting rollouts according to the new policy is often infeasible, as deploying new policies in these domains can be unsafe. In this work, we propose a semi-offline evaluation framework as an intermediate step between offline and online evaluation, where human users provide annotations of unobserved counterfactual trajectories. While tempting to simply augment existing data with such annotations, we show that this naive approach can lead to biased results. Instead, we design a new family of OPE estimators based on importance sampling (IS) and a novel weighting scheme that incorporate counterfactual annotations without introducing additional bias. We analyze the theoretical properties of our approach, showing its potential to reduce both bias and variance compared to standard IS estimators. Our analyses reveal important practical considerations for handling biased, noisy, or missing annotations. In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of RL in high-stakes domains.
翻译:在将强化学习(RL)应用于高风险领域时,利用观测数据进行定量和定性评估有助于从业者理解新策略的泛化性能。然而,此类离线策略评估(OPE)本质上存在局限性,因为离线数据可能无法反映因实施新策略而产生的分布偏移。另一方面,通过根据新策略收集轨迹进行在线评估往往不可行,因为在这些领域部署新策略可能存在安全隐患。在本工作中,我们提出一种半离线评估框架,作为离线评估与在线评估之间的中间步骤,由人类用户提供未观测反事实轨迹的标注。尽管直接将此类标注添加到现有数据中的做法颇具吸引力,但我们证明这种朴素方法可能导致偏倚结果。为此,我们基于重要性抽样(IS)设计了一族新型OPE估计器,并引入新的赋权方案,可在不引入额外偏倚的情况下融合反事实标注。我们分析了所提方法的理论性质,表明其相比标准IS估计器具有降低偏倚与方差的潜力。分析揭示了处理有偏、含噪或缺失标注的重要实践考量。在一系列涉及赌博机与医疗健康启发模拟器的概念验证实验中,我们证明所提方法优于纯离线IS估计器,且对不完美标注具有鲁棒性。该框架结合基于原则的人机协同标注征集设计,可推动强化学习在高风险领域的应用。