In recent years, the development of technologies for causal inference with privacy preservation of distributed data has gained considerable attention. Many existing methods for distributed data focus on resolving the lack of subjects (samples) and can only reduce random errors in estimating treatment effects. In this study, we propose a data collaboration quasi-experiment (DC-QE) that resolves the lack of both subjects and covariates, reducing random errors and biases in the estimation. Our method involves constructing dimensionality-reduced intermediate representations from private data from local parties, sharing intermediate representations instead of private data for privacy preservation, estimating propensity scores from the shared intermediate representations, and finally, estimating the treatment effects from propensity scores. Through numerical experiments on both artificial and real-world data, we confirm that our method leads to better estimation results than individual analyses. While dimensionality reduction loses some information in the private data and causes performance degradation, we observe that sharing intermediate representations with many parties to resolve the lack of subjects and covariates sufficiently improves performance to overcome the degradation caused by dimensionality reduction. Although external validity is not necessarily guaranteed, our results suggest that DC-QE is a promising method. With the widespread use of our method, intermediate representations can be published as open data to help researchers find causalities and accumulate a knowledge base.
翻译:近年来,在隐私保护条件下对分布式数据进行因果推断的技术发展引起了广泛关注。现有针对分布式数据的多数方法主要解决样本不足问题,且仅能减少处理效应估计中的随机误差。本研究提出一种数据协作准实验方法(DC-QE),该方法能同时解决样本和协变量不足问题,从而减少估计中的随机误差与偏差。我们的方法包括:从各参与方的私有数据中构建降维后的中间表示;共享中间表示而非原始私有数据以保护隐私;基于共享的中间表示估计倾向得分;最终通过倾向得分估计处理效应。在人工数据和真实数据上的数值实验均表明,相比单方分析,我们的方法能获得更优的估计结果。虽然降维会导致私有数据部分信息损失并造成性能下降,但我们发现通过多方共享中间表示来解决样本与协变量不足问题,能充分提升性能以克服降维带来的退化。尽管外部效度未必能得到保证,但结果表明DC-QE是一种有前景的方法。随着该方法广泛应用,中间表示可作为开放数据发布,帮助研究者发现因果关系并积累知识库。