In recent years, the development of technologies for causal inference with privacy preservation of distributed data has gained considerable attention. Many existing methods for distributed data focus on resolving the lack of subjects (samples) and can only reduce random errors in estimating treatment effects. In this study, we propose a data collaboration quasi-experiment (DC-QE) that resolves the lack of both subjects and covariates, reducing random errors and biases in the estimation. Our method involves constructing dimensionality-reduced intermediate representations from private data from local parties, sharing intermediate representations instead of private data for privacy preservation, estimating propensity scores from the shared intermediate representations, and finally, estimating the treatment effects from propensity scores. Through numerical experiments on both artificial and real-world data, we confirm that our method leads to better estimation results than individual analyses. While dimensionality reduction loses some information in the private data and causes performance degradation, we observe that sharing intermediate representations with many parties to resolve the lack of subjects and covariates sufficiently improves performance to overcome the degradation caused by dimensionality reduction. Although external validity is not necessarily guaranteed, our results suggest that DC-QE is a promising method. With the widespread use of our method, intermediate representations can be published as open data to help researchers find causalities and accumulate a knowledge base.
翻译:近年来,面向分布式数据的隐私保护因果推断技术发展受到广泛关注。现有分布式数据方法主要聚焦于解决样本量不足问题,且仅能减少处理效应估计中的随机误差。本研究提出数据协同准实验方法(DC-QE),该方法可同时解决样本与协变量不足问题,从而减少估计中的随机误差与偏差。该方法包括:从本地各方私有数据构建降维中间表征,共享中间表征代替私有数据以保护隐私,基于共享中间表征估计倾向得分,最终通过倾向得分估计处理效应。通过人工数据与真实数据的数值实验,本文证实该方法比独立分析获得更优估计结果。虽然降维会丢失部分私有数据信息导致性能下降,但研究表明通过多方共享中间表征来弥补样本与协变量不足,可充分提升性能以克服降维造成的损失。尽管外部有效性未必得到保证,但结果表明DC-QE是一种有前景的方法。随着该方法的广泛应用,中间表征可作为开放数据发布,助力研究者发现因果关系并积累知识库。