Minimax optimal subgroup identification

Quantifying treatment effect heterogeneity is a crucial task in many areas of causal inference, e.g. optimal treatment allocation and estimation of subgroup effects. We study the problem of estimating the level sets of the conditional average treatment effect (CATE), identified under the no-unmeasured-confounders assumption. Given a user-specified threshold, the goal is to estimate the set of all units for whom the treatment effect exceeds that threshold. For example, if the cutoff is zero, the estimand is the set of all units who would benefit from receiving treatment. Assigning treatment just to this set represents the optimal treatment rule that maximises the mean population outcome. Similarly, cutoffs greater than zero represent optimal rules under resource constraints. The level set estimator that we study follows the plug-in principle and consists of simply thresholding a good estimator of the CATE. While many CATE estimators have been recently proposed and analysed, how their properties relate to those of the corresponding level set estimators remains unclear. Our first goal is thus to fill this gap by deriving the asymptotic properties of level set estimators depending on which estimator of the CATE is used. Next, we identify a minimax optimal estimator in a model where the CATE, the propensity score and the outcome model are Holder-smooth of varying orders. We consider data generating processes that satisfy a margin condition governing the probability of observing units for whom the CATE is close to the threshold. We investigate the performance of the estimators in simulations and illustrate our methods on a dataset used to study the effects on mortality of laparoscopic vs open surgery in the treatment of various conditions of the colon.

翻译：量化处理效应异质性是因果推断许多领域中的关键任务，例如最优处理分配和子群效应估计。我们研究条件平均处理效应（CATE）水平集的估计问题，该效应在无未测混杂假设下被识别。给定用户指定的阈值，目标是估计所有处理效应超过该阈值的单元集合。例如，若截断值为零，则估计量为所有能从接受处理中获益的单元集合。仅对此集合分配处理代表了最大化总体平均结果的最优处理规则。类似地，大于零的截断值代表了资源约束下的最优规则。我们研究的水平集估计量遵循插入原则，仅需对CATE的良好估计量进行阈值化。尽管近期提出了许多CATE估计量并进行了分析，但其性质如何对应相应水平集估计量的性质仍不明确。因此，我们的首要目标是通过推导水平集估计量的渐近性质来填补这一空白，该性质取决于所使用的CATE估计量。接下来，我们在一个模型中识别出极小极大最优估计量，其中CATE、倾向得分和结果模型具有不同阶数的赫尔德光滑性。我们考虑满足边际条件的数据生成过程，该条件控制观察到CATE接近阈值的单元的概率。我们通过模拟研究评估估计量的性能，并利用一个用于研究腹腔镜与开腹手术治疗结肠多种疾病对死亡率影响的数据集来说明我们的方法。