Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.
翻译:随机森林(RF)是用于集中式表格数据最强大且广泛使用的预测模型之一,但目前仅有少数方法将其适配到联邦学习场景。与大多数联邦学习方法不同,RF的逐段常数特性使其无法直接进行基于梯度的优化。因此,现有的联邦RF实现依赖于缺乏原理依据的启发式方法:例如,简单聚合各客户端独立训练的决策树,即使存在简单的分布偏移,也无法优化全局不纯度准则。本文提出FedForest——一种适用于水平分割数据的新型联邦RF算法,该算法能够自然适应多种形式的客户端数据异质性,从协变量偏移到更复杂的结果偏移机制。我们证明,基于聚合精心选取的客户端统计量的分裂过程,能够紧密逼近集中式算法选出的分裂点。此外,FedForest允许在客户端指示变量上进行分裂,从而实现了先前联邦随机森林方法缺失的非参数个性化形式。实验表明,所生成的联邦森林在异质基准测试中与集中式性能高度匹配,同时保持通信高效性。