Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but can also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on their time-evolution and effect on the data-collection policies. We determine when consistent value estimates are not achievable, providing and discussing algorithms to estimate lower bounds with guarantees in those cases. When consistent estimates are achievable, we provide sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on gridworld and a simulated healthcare setting of managing sepsis patients. We note that in gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, our methods significantly outperform confounder-oblivious benchmarks.
翻译:在存在未观测混杂因素的情况下评估和优化策略,是离线强化学习中一个日益受到关注的问题。在存在混杂因素的场景下使用传统的离线强化学习方法,不仅可能导致糟糕的决策和策略,还可能在医疗、教育等关键应用中造成灾难性后果。我们系统地梳理了混杂马尔可夫决策过程的离线策略评估研究全景,根据混杂因素的时间演化特性及其对数据收集策略的影响,区分了不同的混杂假设。我们明确了在哪些情况下无法获得一致的价值估计,并针对这些情况提供了算法来估计带有保证的下界,同时进行了讨论。当能够获得一致估计时,我们给出了样本复杂度保证。我们还提出了新的离线策略改进算法,并证明了局部收敛性保证。最后,我们在网格世界和模拟的败血症患者管理医疗场景中实验评估了我们的算法。我们注意到,在网格世界中,我们基于模型的方法提供了比现有方法更紧的下界,而在败血症模拟器中,我们的方法显著优于忽视混杂因素的基准方法。