In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functions. We then show how to construct semiparametrically efficient estimators in these settings. We term the resulting framework proximal reinforcement learning (PRL). We demonstrate the benefits of PRL in an extensive simulation study and on the problem of sepsis management.
翻译:在离线强化学习应用于观测数据(例如医疗或教育领域)时,一个普遍担忧是观测到的动作可能受未观测因素影响,从而引发混淆偏差,并导致在完美马尔可夫决策过程(MDP)模型假设下所得到的估计结果出现偏差。本文通过考虑部分可观测MDP(POMDP)中的离线策略评估来应对这一问题。具体而言,我们考虑在POMDP中估计给定目标策略的价值,其中轨迹仅包含由另一未知策略(可能依赖于未观测状态)生成的局部状态观测。我们解决两个问题:哪些条件能够使我们从观测数据中识别出目标策略价值,以及在识别成立时如何最优地估计该价值。为回答这些问题,我们将近端因果推断框架扩展到POMDP设定中,提供了多种通过所谓桥函数的存在性实现识别的场景。随后,我们展示了如何在这些场景中构建半参数有效估计量。我们将所提出的框架称为近端强化学习(PRL)。我们通过广泛的仿真研究以及败血症管理问题,展示了PRL的优势。