Upholding data privacy especially in medical research has become tantamount to facing difficulties in accessing individual-level patient data. Estimating mixed effects binary logistic regression models involving data from multiple data providers like hospitals thus becomes more challenging. Federated learning has emerged as an option to preserve the privacy of individual observations while still estimating a global model that can be interpreted on the individual level, but it usually involves iterative communication between the data providers and the data analyst. In this paper, we present a strategy to estimate a mixed effects binary logistic regression model that requires data providers to share summary statistics only once. It involves generating pseudo-data whose summary statistics match those of the actual data and using these into the model estimation process instead of the actual unavailable data. Our strategy is able to include multiple predictors which can be a combination of continuous and categorical variables. Through simulation, we show that our approach estimates the true model at least as good as the one which requires the pooled individual observations. An illustrative example using real data is provided. Unlike typical federated learning algorithms, our approach eliminates infrastructure requirements and security issues while being communication efficient and while accounting for heterogeneity.
翻译:在医学研究中,维护数据隐私尤其重要,但这也导致获取患者个体层面数据变得困难。因此,涉及医院等多个数据提供方的混合效应二元逻辑回归模型的估计变得更加具有挑战性。联邦学习已成为一种在保护个体观测数据隐私的同时,仍能估计可在个体层面解释的全局模型的可行方案,但该方法通常需要在数据提供方与数据分析师之间进行多次迭代通信。本文提出一种估计混合效应二元逻辑回归模型的策略,该策略仅要求数据提供方一次性共享汇总统计量。该策略通过生成汇总统计量与真实数据相匹配的伪数据,并在模型估计过程中使用这些伪数据替代实际不可用的真实数据。我们的策略能够纳入多个预测变量,这些变量可以是连续变量与分类变量的组合。通过模拟实验,我们证明本方法对真实模型的估计效果至少不亚于需要汇集个体观测数据的方法。文中还提供了一个使用真实数据的示例。与典型的联邦学习算法不同,本方法在考虑异质性的同时,消除了对基础设施的要求和安全问题,并且实现了高效的通信。