Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.
翻译:随机森林(RF)是集中式表格数据中最强大且应用最广泛的预测模型之一,然而目前鲜有方法能使其适应联邦学习环境。与大多数联邦学习方法不同,RF的分段常数特性阻碍了基于梯度的精确优化。因此,现有的联邦RF实现依赖于缺乏理论依据的启发式方法:例如,在简单的分布偏移下,聚合在客户端独立训练的决策树也无法优化全局不纯度准则。我们提出FedForest,一种面向水平分区数据的新型联邦RF算法,它自然地适应了从协变量偏移到更复杂结果偏移机制等多种形式的客户端数据异构性。我们证明了基于聚合精心选择的客户端统计量的分裂过程,能够紧密逼近集中式算法所选择的分裂。此外,FedForest允许基于客户端指示变量进行分裂,从而实现了一种非参数形式的个性化,这是先前联邦随机森林方法所不具备的。实证结果表明,在异构基准测试中,所生成的联邦森林在保持通信高效的同时,其性能与集中式方法高度匹配。