Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: \emph{Causal k-Means Clustering}, which leverages the k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods using a study of mobile-supported self-management for chronic low back pain.
翻译:因果效应通常以总体特征进行表征。当不同子组存在异质性处理效应时,这种表征可能无法提供完整的图景。由于子组结构通常是未知的,识别和评估子组效应比总体效应更具挑战性。我们提出了一种新的解决方案:因果K均值聚类,该方法利用K均值聚类算法来揭示未知的子组结构。我们的问题与传统的聚类设置存在显著差异,因为待聚类的变量是未知的反事实函数。我们提出了一种简单且可直接通过现有算法实现的插件估计量,并研究了其收敛速度。此外,我们基于非参数效率理论和双机器学习,开发了一种新的偏差校正估计量,并证明该估计量在大型非参数模型中能够达到快速的根n收敛速度和渐近正态性。我们提出的方法尤其适用于具有多个处理水平的现代结局广泛研究。此外,我们的框架可扩展至基于通用伪结局(如部分观测结局或其他未知函数)的聚类。最后,我们通过模拟研究了有限样本性质,并利用一项关于慢性腰痛移动支持自我管理的研究对所提方法进行了实证说明。