We develop Clustered Random Forests, a random forests algorithm for clustered data, arising from independent groups that exhibit within-cluster dependence. The leaf-wise predictions for each decision tree making up clustered random forests takes the form of a weighted least squares estimator, which leverage correlations between observations for improved prediction accuracy and tighter confidence intervals when performing inference. We show that approximately linear time algorithms exist for fitting classes of clustered random forests, matching the computational complexity of standard random forests. Further, we observe that the optimality of a clustered random forest, with regards to how optimal weights are chosen within this framework i.e. those that minimise mean squared prediction error, vary under covariate distribution shift. In light of this, we advocate weight estimation to be determined by a user-chosen covariate distribution, or test dataset of covariates, with respect to which optimal prediction or inference is desired. This highlights a key distinction between correlated and independent data with regards to optimality of nonparametric conditional mean estimation under covariate shift. We demonstrate our theoretical findings numerically in a number of simulated and real-world settings.
翻译:本文提出聚类随机森林算法,这是一种针对聚类数据(源自具有组内相关性的独立群组)的随机森林方法。构成聚类随机森林的每棵决策树采用加权最小二乘估计器进行叶节点预测,该估计器通过利用观测值间的相关性,在推断时实现更高的预测精度与更紧致的置信区间。我们证明存在近似线性时间算法可用于拟合各类聚类随机森林,其计算复杂度与标准随机森林相当。进一步研究发现,在此框架下(即选择最小化均方预测误差的最优权重时),聚类随机森林的最优性会随协变量分布偏移而变化。基于此,我们主张权重估计应由用户指定的协变量分布或协变量测试数据集来确定,以期在该分布下获得最优预测或推断效果。这揭示了相关数据与独立数据在协变量偏移下非参数条件均值估计最优性方面的关键差异。我们通过多组模拟实验与真实场景数据验证了理论发现。