Surgical Aggregation: A Federated Learning Framework for Harmonizing Distributed Datasets with Diverse Tasks

Many large-scale chest x-ray datasets have been curated for the detection of abnormalities using deep learning, with the potential to provide substantial benefits across many clinical applications. However, these datasets focus on detecting a subset of disease labels that could be present, thus limiting their clinical utility. Furthermore, the distributed nature of these datasets, along with data sharing regulations, makes it difficult to share and create a complete representation of disease labels. To that end, we propose surgical aggregation, a federated learning framework for aggregating and harmonizing knowledge from distributed datasets with different disease labels into a 'global' deep learning model. We utilized surgical aggregation to harmonize the NIH (14 labels) and CheXpert (13 labels) datasets into a global model with the ability to predict all 20 unique disease labels and compared it to the performance of 'baseline' models trained individually on both datasets. We observed that the global model resulted in excellent performance across held-out test sets from both datasets with an average AUROC of 0.75 and 0.74 respectively when compared to the baseline average AUROC of 0.81 and 0.71. On the MIMIC external test set, we observed that the global model had better generalizability with average AUROC of 0.80, compared to the average AUROC of 0.74 and 0.76 respectively for the baseline models. Our results show that surgical aggregation has the potential to develop clinically useful deep learning models by aggregating knowledge from distributed datasets with diverse tasks -- a step forward towards bridging the gap from bench to bedside.

翻译：大量大规模胸部X光数据集已被整理用于基于深度学习的异常检测，有望为众多临床应用带来显著效益。然而，这些数据集仅关注部分可能出现的疾病标签检测，限制了其临床实用性。此外，这些数据集的分布式特性以及数据共享法规，使得难以共享并创建完整的疾病标签表征。为此，我们提出外科聚合——一种联邦学习框架，用于聚合和协调来自不同疾病标签分布式数据集的知识，形成“全局”深度学习模型。我们利用外科聚合将NIH数据集（14个标签）与CheXpert数据集（13个标签）协调统一为可预测全部20种独特疾病标签的全局模型，并将其与分别在两个数据集上训练的“基线”模型性能进行对比。观察发现，在来自两个数据集的保留测试集上，全局模型表现出优异性能，平均AUROC分别为0.75和0.74，而基线模型的平均AUROC分别为0.81和0.71。在MIMIC外部测试集上，全局模型展现出更优泛化能力，平均AUROC达0.80，而基线模型的平均AUROC分别为0.74和0.76。研究结果表明，外科聚合通过聚合来自异构任务分布式数据集的知识，有望开发出具有临床实用性的深度学习模型——这是弥合从实验室到临床应用鸿沟的重要进展。