The age of big data has fueled expectations for accelerating learning. The availability of large data sets enables researchers to achieve more powerful statistical analyses and enhances the reliability of conclusions, which can be based on a broad collection of subjects. Often such data sets can be assembled only with access to diverse sources; for example, medical research that combines data from multiple centers in a federated analysis. However these hopes must be balanced against data privacy concerns, which hinder sharing raw data among centers. Consequently, federated analyses typically resort to sharing data summaries from each center. The limitation to summaries carries the risk that it will impair the efficiency of statistical analysis procedures. In this work we take a close look at the effects of federated analysis on two very basic problems, nonparametric comparison of two groups and quantile estimation to describe the corresponding distributions. We also propose a specific privacy-preserving data release policy for federated analysis with the $K$-anonymity criterion, which has been adopted by the Medical Informatics Platform of the European Human Brain Project. Our results show that, for our tasks, there is only a modest loss of statistical efficiency.
翻译:大数据时代激发了对加速学习的期望。大型数据集的可用性使研究人员能够实现更强大的统计分析,并增强基于广泛受试者群体所得出的结论的可靠性。此类数据集通常需要整合多种数据源才能构建;例如,结合多个中心数据进行联合分析的医学研究。然而,这些期望必须与数据隐私问题相平衡,后者阻碍了中心间原始数据的共享。因此,联邦分析通常依赖于共享各中心的数据摘要。限于摘要数据可能带来削弱统计分析程序效率的风险。本研究深入探讨了联邦分析对两个基本问题的影响:两组样本的非参数比较以及描述相应分布的分位数估计。我们还提出了一种基于$K$-匿名准则的特定隐私保护数据发布策略,该策略已被欧洲人脑计划医学信息平台采用。研究结果表明,对于我们的任务而言,统计效率仅出现适度损失。