Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. Though this assumption, sequential ignorability/unconfoundedness, likely does not hold in observational data, most of the data that accounts for selection into treatment may be observed, motivating sensitivity analysis. We study robust policy evaluation and policy optimization in the presence of sequentially-exogenous unobserved confounders under a sensitivity model. We propose and analyze orthogonalized robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function, and adds a bias-correction to quantile estimation. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness both in simulations and on real-world longitudinal healthcare data of treating sepsis. In particular, our model of sequential unobserved confounders yields an online Markov decision process, rather than partially observed Markov decision process: we illustrate how this can enable warm-starting optimistic reinforcement learning algorithms with valid robust bounds from observational data.
翻译:离线强化学习在医学、经济学和电子商务等领域至关重要,因为这些领域的在线实验成本高昂、危险或不道德,且真实模型未知。然而,大多数方法假设行为策略决策中使用的所有协变量都是可观测的。尽管这一假设(序列可忽略性/无混杂性)在观测数据中可能不成立,但大多数影响治疗选择的数据可能是可观测的,这推动了敏感性分析。我们研究了在敏感性模型下存在序列外生未观测混杂时的鲁棒策略评估与策略优化。我们提出并分析了正交化鲁棒拟合Q迭代方法,该方法利用鲁棒贝尔曼算子的闭式解来推导鲁棒Q函数的损失最小化问题,并在分位数估计中加入了偏差校正。我们的算法兼具拟合Q迭代的计算简便性,以及正交化带来的统计改进(降低了对分位数估计误差的依赖)。我们提供了样本复杂度界、见解,并在仿真和现实世界的败血症治疗的纵向医疗数据上展示了其有效性。特别是,我们的序列未观测混杂模型产生了一个在线马尔可夫决策过程,而非部分可观测马尔可夫决策过程:我们阐述了这如何能够利用观测数据中的有效鲁棒界限来热启动乐观强化学习算法。