Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. Though this assumption, sequential ignorability/unconfoundedness, likely does not hold in observational data, most of the data that accounts for selection into treatment may be observed, motivating sensitivity analysis. We study robust policy evaluation and policy optimization in the presence of sequentially-exogenous unobserved confounders under a sensitivity model. We propose and analyze orthogonalized robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function, and adds a bias-correction to quantile estimation. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness both in simulations and on real-world longitudinal healthcare data of treating sepsis. In particular, our model of sequential unobserved confounders yields an online Markov decision process, rather than partially observed Markov decision process: we illustrate how this can enable warm-starting optimistic reinforcement learning algorithms with valid robust bounds from observational data.
翻译:离线强化学习在医学、经济学和电子商务等领域具有重要意义,这些领域中的在线实验成本高昂、具有危险性或不道德,且真实模型未知。然而,大多数方法假设行为策略行动决策中使用的所有协变量均被观测。尽管这一假设(序列可忽略性/无混杂性)在观测数据中可能不成立,但记录治疗选择的大部分数据可能已被观测,这推动了敏感性分析的研究。我们在敏感性模型下,研究存在序列外生未观测混杂因素时的鲁棒策略评估与策略优化。我们提出并分析了正交化鲁棒拟合Q迭代算法,该算法利用鲁棒贝尔曼算子的闭式解推导鲁棒Q函数的损失最小化问题,并对分位数估计添加偏差校正。我们的算法兼具拟合Q迭代的计算简便性,以及正交化带来的统计改进(降低对分位数估计误差的依赖)。我们提供了样本复杂度界限和理论见解,并在模拟实验和真实世界脓毒症治疗的纵向医疗数据上验证了算法的有效性。特别地,我们的序列未观测混杂因素模型导出了一个在线马尔可夫决策过程,而非部分可观测马尔可夫决策过程:我们阐释了如何利用观测数据中有效的鲁棒界限,为乐观强化学习算法实现热启动。