Privacy constraints have driven the rise of federated learning (FL), which enables multi-site analyses without sharing individual participant data. Existing FL estimators largely assume complete data, whereas multi-site studies often face missingness. We develop a framework for FL with missing data, identifying conditions under which the complete case (CC) estimator is preferred over the inverse probability weighting (IPW) estimator. For settings where the CC estimator leads to bias, we introduce a calibrated weight estimation approach that combines candidate weighting models across sites and remains consistent if at least one is correctly specified at each site; we further show that pooling many weighting candidate models with redundant information degrades the calibrated estimator, so a small set is preferable. Consistency conditions are stated at the site level, ensuring that the federated estimator inherits validity from site-level properties. We prove consistency and derive a sandwich variance estimator that accounts for uncertainty in the outcome model, and in both the estimated weighting models and the calibration step. Additionally, we show that all estimators require only one or a few communication rounds, making them practical under real-world data-governance constraints. We illustrate the framework by evaluating risk factors for 90-day mortality among patients with pleural infections treated with intrapleural enzyme therapy.
翻译:暂无翻译