Patient privacy is a major barrier to healthcare AI. For confidentiality reasons, most patient data remains in silo in separate hospitals, preventing the design of data-driven healthcare AI systems that need large volumes of patient data to make effective decisions. A solution to this is collective learning across multiple sites through federated learning with differential privacy. However, literature in this space typically focuses on differentially private statistical estimation and machine learning, which is different from the causal inference-related problems that arise in healthcare. In this work, we take a fresh look at federated learning with a focus on causal inference; specifically, we look at estimating the average treatment effect (ATE), an important task in causal inference for healthcare applications, and provide a federated analytics approach to enable ATE estimation across multiple sites along with differential privacy (DP) guarantees at each site. The main challenge comes from site heterogeneity -- different sites have different sample sizes and privacy budgets. We address this through a class of per-site estimation algorithms that reports the ATE estimate and its variance as a quality measure, and an aggregation algorithm on the server side that minimizes the overall variance of the final ATE estimate. Our experiments on real and synthetic data show that our method reliably aggregates private statistics across sites and provides better privacy-utility tradeoff under site heterogeneity than baselines.
翻译:患者隐私是医疗人工智能的主要障碍。出于保密原因,大多数患者数据仍孤立存储于不同医院,阻碍了需要大量患者数据才能做出有效决策的数据驱动型医疗AI系统的设计。解决方案是通过联邦学习结合差分隐私实现跨多站点的集体学习。然而,该领域现有文献主要关注差分隐私统计估计与机器学习,这与医疗领域中出现的因果推断相关问题存在差异。本研究重新审视联邦学习,重点关注因果推断;具体而言,我们聚焦于平均治疗效果(ATE)的估计——这是医疗应用中因果推断的重要任务,并提出一种联邦分析方法,在保证各站点差分隐私(DP)的前提下,实现跨多站点的ATE估计。主要挑战来自站点异质性——不同站点拥有不同的样本量和隐私预算。我们通过设计一类逐站点估计算法(该算法输出ATE估计值及其方差作为质量指标),以及服务器端聚合算法(该算法最小化最终ATE估计的总方差)来应对这一挑战。在真实与合成数据上的实验表明,我们的方法能可靠地跨站点聚合私有统计量,并在站点异质性下提供优于基准方法的隐私-效用权衡。