Treatment heterogeneity is ubiquitous in many areas, motivating practitioners to search for the optimal policy that maximizes the expected outcome based on individualized characteristics. However, most existing policy learning methods rely on weighting-based approaches, which may suffer from high instability in observational studies. To enhance the robustness of the estimated policy, we propose a matching-based estimator of the policy improvement upon a randomized baseline. After correcting the conditional bias, we learn the optimal policy by maximizing the estimate over a policy class. We derive a non-asymptotic high probability bound for the regret of the learned policy and show that the convergence rate is almost $1/\sqrt{n}$. The competitive finite sample performance of the proposed method is demonstrated in extensive simulation studies and a real data application.
翻译:在许多领域中,处理异质性普遍存在,这促使实践者根据个体化特征寻找能够最大化期望结果的最优策略。然而,现有的大多数策略学习方法依赖于基于加权的方法,这些方法在观察性研究中可能面临高不稳定性。为了增强估计策略的稳健性,我们提出了一种基于匹配的估计器,用于评估相对于随机基线的策略改进。在修正条件偏差后,我们通过在一个策略类上最大化估计值来学习最优策略。我们推导了学习策略遗憾的非渐近高概率界,并证明其收敛速率几乎为$1/\sqrt{n}$。所提出方法在广泛的模拟研究和实际数据应用中展现了具有竞争力的有限样本性能。