This paper explores policy learning from observational data, focusing on a nonlinear welfare criterion in a binary treatment setting. The nonlinear criterion is inspired by scenarios where policymakers prioritize specific population segments. We model this criterion using a utility function that encompasses potential outcomes and intermediate parameters, with the latter capturing higher moments of the outcome distributions. When formulated in the context of observational data, both the intermediate parameters and the welfare criterion depend on the propensity score, which we estimate using machine-learning techniques. To address bias in machine learning estimates, we introduce a novel reweighting-based debiasing approach that offers a promising alternative to traditional orthogonality-based methods. To tackle the complexities of infinite-dimensional policy spaces, we employ sieve approximations and $K$-fold cross-validation for model selection, thereby fully automating the policy-learning process. Despite these complexities, we demonstrate that both the welfare regret and the average welfare regret of our proposed policy learning method satisfy an oracle inequality, thereby providing theoretical guarantees on the performance of the estimated policy relative to the best possible policy. This finding extends the existing results from linear to nonlinear welfare criteria, from finite-dimensional to infinite-dimensional policy spaces, and from a known propensity score to a machine-learned one.
翻译:本文探讨基于观测数据的策略学习,重点关注二元处理设置下的非线性福利准则。该非线性准则源于政策制定者优先考虑特定人口子群体的场景。我们采用包含潜在结果与中间参数的效用函数对此准则建模,其中中间参数捕捉结果分布的高阶矩。在观测数据情境下建模时,中间参数与福利准则均依赖于倾向得分,我们使用机器学习技术对其进行估计。为解决机器学习估计的偏差问题,我们提出一种基于重新加权的去偏方法,该方法为传统正交性方法提供了颇具前景的替代方案。为应对无限维策略空间的复杂性,我们采用筛逼近和K折交叉验证进行模型选择,从而完全自动化策略学习过程。尽管存在这些复杂性,我们证明所提出的策略学习方法的福利遗憾值与平均福利遗憾值均满足奥拉克不等式,从而为估计策略相对于最优策略的性能提供了理论保证。该发现将现有成果从线性福利准则拓展至非线性福利准则、从有限维策略空间拓展至无限维策略空间、从已知倾向得分拓展至机器学习估计的倾向得分。