Reinforcement learning (RL) has been extensively researched for enhancing human-environment interactions in various human-centric tasks, including e-learning and healthcare. Since deploying and evaluating policies online are high-stakes in such tasks, off-policy evaluation (OPE) is crucial for inducing effective policies. In human-centric environments, however, OPE is challenging because the underlying state is often unobservable, while only aggregate rewards can be observed (students' test scores or whether a patient is released from the hospital eventually). In this work, we propose a human-centric OPE (HOPE) to handle partial observability and aggregated rewards in such environments. Specifically, we reconstruct immediate rewards from the aggregated rewards considering partial observability to estimate expected total returns. We provide a theoretical bound for the proposed method, and we have conducted extensive experiments in real-world human-centric tasks, including sepsis treatments and an intelligent tutoring system. Our approach reliably predicts the returns of different policies and outperforms state-of-the-art benchmarks using both standard validation methods and human-centric significance tests.
翻译:强化学习(RL)在电子学习与医疗等众多以人为中心的任务中,已被广泛研究用于增强人机交互。由于在此类任务中在线部署和评估策略具有高风险性,离线策略评估(OPE)对于诱导有效策略至关重要。然而,在以人为中心的环境中,底层状态往往不可观测,且仅能观察到聚合奖励(如学生考试成绩或患者是否最终出院),这使得OPE极具挑战性。本研究提出了一种以人为中心的离线策略评估方法(HOPE),以应对此类环境中的部分可观测性与聚合奖励问题。具体而言,我们考虑部分可观测性,从聚合奖励中重构即时奖励,从而估计期望总回报。我们为该方法的理论边界提供了证明,并在现实世界的人为中心任务(包括脓毒症治疗与智能辅导系统)中开展了大量实验。采用标准验证方法与以人为中心显著性检验,我们的方法能可靠地预测不同策略的回报,并优于现有最优基准方法。