Federated Rule Ensemble Method in Medical Data

Machine learning has become integral to medical research and is increasingly applied in clinical settings to support diagnosis and decision-making; however, its effectiveness depends on access to large, diverse datasets, which are limited within single institutions. Although integrating data across institutions can address this limitation, privacy regulations and data ownership constraints hinder these efforts. Federated learning enables collaborative model training without sharing raw data; however, most methods rely on complex architectures that lack interpretability, limiting clinical applicability. Therefore, we proposed a federated RuleFit framework to construct a unified and interpretable global model for distributed environments. It integrates three components: preprocessing based on differentially private histograms to estimate shared cutoff values, enabling consistent rule definitions and reducing heterogeneity across clients; local rule generation using gradient boosting decision trees with shared cutoffs; and coefficient estimation via $\ell_1$-regularized optimization using a Federated Dual Averaging algorithm for sparse and consistent variable selection. In simulation studies, the proposed method achieved a performance comparable to that of centralized RuleFit while outperforming existing federated approaches. Real-world analysis demonstrated its ability to provide interpretable insights with competitive predictive accuracy. Therefore, the proposed framework offers a practical and effective solution for interpretable and reliable modeling in federated learning environments.

翻译：机器学习已成为医学研究的重要组成部分，并越来越多地应用于临床环境以支持诊断与决策；然而，其有效性取决于获取大规模、多样化的数据集，而这些数据在单一机构内往往有限。尽管跨机构整合数据可克服这一局限，但隐私法规与数据所有权约束阻碍了此类努力。联邦学习能够在无需共享原始数据的情况下实现协作式模型训练；然而，多数方法依赖于缺乏可解释性的复杂架构，限制了其临床适用性。为此，我们提出了一种联邦RuleFit框架，用于在分布式环境下构建统一且可解释的全局模型。该框架整合了三个组成部分：基于差分隐私直方图的预处理方法，用于估计共享的阈值，从而确保规则定义的一致性并减少客户端间的异质性；基于梯度提升决策树并利用共享阈值的局部规则生成；以及通过使用联邦对偶平均算法进行ℓ1正则化优化的系数估计，以实现稀疏且一致的变量选择。在模拟研究中，所提方法达到了与集中式RuleFit相当的性能，同时优于现有的联邦方法。真实世界分析表明，该方法能够在提供可解释见解的同时保持具有竞争力的预测准确性。因此，所提框架为联邦学习环境下的可解释且可靠的建模提供了实用且有效的解决方案。