Transfer Learning with Distance Covariance for Random Forest: Error Bounds and an EHR Application

We propose a method for transfer learning in nonparametric regression using a random forest (RF) with distance covariance-based feature weights, assuming the unknown source and target regression functions are sparsely different. Our method obtains residuals from a source domain-trained Centered RF (CRF) in the target domain, then fits another CRF to these residuals with feature splitting probabilities proportional to feature-residual sample distance covariance. We derive an upper bound on the mean square error rate of the procedure as a function of sample sizes and difference dimension, theoretically demonstrating transfer learning benefits in random forests. A major difficulty for transfer learning in random forests is the lack of explicit regularization in the method. Our results explain why shallower trees with preferential selection of features lead to both lower bias and lower variance for fitting a low-dimensional function. We show that in the residual random forest, this implicit regularization is enabled by sample distance covariance. In simulations, we show that the results obtained for the CRFs also hold numerically for the standard RF (SRF) method with data-driven feature split selection. Beyond transfer learning, our results also show the benefit of distance-covariance-based weights on the performance of RF when some features dominate. Our method shows significant gains in predicting the mortality of ICU patients in smaller-bed target hospitals using a large multi-hospital dataset of electronic health records for 200,000 ICU patients.

翻译：本文提出一种非参数回归中的迁移学习方法，该方法采用基于距离协方差特征权重的随机森林，并假设未知的源域与目标域回归函数具有稀疏差异性。我们的方法首先通过源域训练的中心化随机森林在目标域获取残差，随后以特征-残差样本距离协方差为比例的特征分裂概率，拟合另一个中心化随机森林来处理这些残差。我们推导了该过程均方误差率关于样本量与差异维度的上界，从理论上证明了随机森林中迁移学习的优势。随机森林迁移学习的主要困难在于方法中缺乏显式正则化。我们的研究结果解释了为何采用特征优先选择的较浅树结构能够同时降低拟合低维函数时的偏差与方差。研究表明，在残差随机森林中，这种隐式正则化是通过样本距离协方差实现的。在仿真实验中，我们证明中心化随机森林的结论同样适用于采用数据驱动特征分裂选择的标准随机森林方法。除迁移学习外，我们的研究还表明当某些特征占主导地位时，基于距离协方差的权重能够提升随机森林的性能。通过使用包含20万ICU患者的电子健康记录多医院数据集，我们的方法在预测小型目标医院ICU患者死亡率方面展现出显著优势。