Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: \emph{Causal k-Means Clustering}, which harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods using a study of mobile-supported self-management for chronic low back pain.
翻译:因果效应通常以总体统计特征来描述,但当存在跨亚组的异质性处理效应时,这种描述可能无法全面反映情况。由于亚组结构通常是未知的,识别和评估亚组效应比总体效应更具挑战性。针对这一问题,我们提出了一种新解决方案:**因果K均值聚类**,它利用广泛使用的K均值聚类算法来揭示未知的亚组结构。我们的问题与传统聚类设置显著不同,因为需要聚类的变量是未知的反事实函数。我们提出了一种插件估计量,该估计量简单易行,可通过现成算法实现,并研究了其收敛速度。此外,我们还基于非参数效率理论和双机器学习开发了一种新的偏差校正估计量,证明该估计量在大型非参数模型中能够实现快速根号n收敛和渐近正态性。所提出的方法特别适用于现代多处理水平的结果广度研究。进一步地,我们的框架可扩展到基于通用伪结果的聚类,例如部分观测结果或其他未知函数。最后,我们通过模拟实验探究了有限样本性质,并利用一项针对慢性下背痛的移动支持自我管理研究对方法进行了实证说明。