Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. This untestable assumption may be incorrect. We study robust policy evaluation and policy optimization in the presence of unobserved confounders. We assume the extent of possible unobserved confounding can be bounded by a sensitivity model, and that the unobserved confounders are sequentially exogenous. We propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness in simulations.
翻译:离线强化学习在医学、经济学和电子商务等领域具有重要意义,因为这些场景中在线实验成本高昂、存在危险性或不道德,且真实模型未知。然而,大多数方法假设行为策略制定决策时所使用的所有协变量都是可观测的。这一无法验证的假设可能并不成立。我们研究了存在未观测混杂因素时的鲁棒策略评估与策略优化问题。我们假设未观测混杂的程度可通过一个敏感性模型进行约束,且这些未观测混杂因素具有序列外生性。我们提出并分析了一种(正交化的)鲁棒拟合Q迭代方法,该方法利用鲁棒贝尔曼算子的闭式解,推导出关于鲁棒Q函数的损失最小化问题。我们的算法兼具拟合Q迭代的计算简洁性,以及通过正交化实现的统计性能提升(降低对分位数估计误差的依赖)。我们给出了样本复杂度界与理论洞见,并通过仿真实验验证了其有效性。