Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible. We propose a framework in which each party shares a differentially private synthetic twin of their data. We study the feasibility of combining such synthetic twin data sets for collaborative learning on real-world health data from the UK Biobank. We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of target statistics compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. Based on our results we conclude that sharing of synthetic twins is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. The setting of distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
翻译:考虑一个场景:多个持有敏感数据的参与方希望协同学习总体统计量,但无法合并敏感数据集。我们提出一个框架,其中每个参与方共享其数据的差分隐私合成孪生副本。我们研究了结合此类合成孪生数据集,在英国生物银行的实际健康数据上进行协同学习的可行性。我们发现,与仅使用本地数据相比,通过共享合成数据参与协同学习的参与方能够获得更准确的目标统计量估计。这一发现可推广至小规模异质性数据集的困难情形。此外,参与方数量越多,改进幅度越大且越稳定。最后,我们发现数据共享特别有助于那些数据中包含代表性不足群体的参与方,能够对这些群体进行更完善的分析。基于我们的研究结果,我们得出结论:共享合成孪生数据是一种可行方法,即使单个数据集规模较小或无法很好地代表总体人群,也能在遵守隐私约束的前提下实现从敏感数据中学习。分布式敏感数据的问题通常是生物医学研究中的瓶颈,我们的研究表明,通过隐私保护的协同学习方法可以缓解这一问题。