Estimation of conditional average treatment effects (CATEs) is an important topic in sciences. CATEs can be estimated with high accuracy if distributed data across multiple parties can be centralized. However, it is difficult to aggregate such data owing to privacy concerns. To address this issue, we proposed data collaboration double machine learning, a method that can estimate CATE models with privacy preservation of distributed data, and evaluated the method through simulations. Our contributions are summarized in the following three points. First, our method enables estimation and testing of semi-parametric CATE models without iterative communication on distributed data. Semi-parametric CATE models enable estimation and testing that is more robust to model mis-specification than parametric models. Second, our method enables collaborative estimation between multiple time points and different parties. Third, our method performed equally or better than other methods in simulations using synthetic, semi-synthetic and real-world datasets.
翻译:条件平均处理效应(CATE)的估计是科学领域的一个重要课题。若能集中多方持有的分布式数据,则可高精度估计CATE。然而,出于隐私考虑,此类数据往往难以聚合。为解决此问题,我们提出了数据协作双重机器学习方法,该方法能在保护分布式数据隐私的前提下估计CATE模型,并通过仿真实验对该方法进行了评估。我们的贡献可总结为以下三点。首先,我们的方法能够在分布式数据上无需迭代通信即可估计和检验半参数CATE模型。与参数模型相比,半参数CATE模型对模型误设具有更强的稳健性。其次,我们的方法支持跨多个时间点及不同参与方之间的协作估计。第三,在基于合成数据、半合成数据及真实数据集的仿真实验中,我们的方法表现与其他方法相当或更优。