The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.
翻译:多站点数据存储分析日益普及,引发了数据存储与通信安全的新担忧。联邦学习无需集中化数据,是避免密集数据传输、保障数据价值及个人信息保护的常用方法。因此,如何聚合各本地站点数据分析所得信息已成为重要的统计问题。由于数据异质性及各站点结果不可比,常用的平均方法可能不适用,应用则可能导致个体分析信息丢失。在分布式计算联邦学习中采用顺序方法,能促进信息整合并加速分析进程。我们开发了一种数据驱动方法,通过分析本地数据高效聚合有价值信息,避免信息安全和数据传输导致的密集负载等潜在问题。此外,该方法在广义线性模型中可保留经典顺序自适应设计的特性(如数据驱动样本量和估计精度)。我们使用模拟数据数值研究及墨西哥32家医院收集的COVID-19数据应用案例,验证了所提方法。