Since their introduction by Breiman, Random Forests (RFs) have proven to be useful for both classification and regression tasks. The RF prediction of a previously unseen observation can be represented as a weighted sum of all training sample observations. This nearest-neighbor-type representation is useful, among other things, for constructing forecast distributions (Meinshausen, 2006). In this paper, we consider simplifying RF-based forecast distributions by sparsifying them. That is, we focus on a small subset of nearest neighbors while setting the remaining weights to zero. This sparsification step greatly improves the interpretability of RF predictions. It can be applied to any forecasting task without re-training existing RF models. In empirical experiments, we document that the simplified predictions can be similar to or exceed the original ones in terms of forecasting performance. We explore the statistical sources of this finding via a stylized analytical model of RFs. The model suggests that simplification is particularly promising if the unknown true forecast distribution contains many small weights that are estimated imprecisely.
翻译:自Breiman提出以来,随机森林(RFs)已被证明在分类和回归任务中均具有实用价值。对于未见观测值,随机森林的预测可表示为所有训练样本观测值的加权和。这种近邻型表示方法(除其他用途外)有助于构建预测分布(Meinshausen, 2006)。本文通过稀疏化处理来简化基于随机森林的预测分布,即聚焦于少量近邻子集,同时将其余权重设为零。该稀疏化步骤显著提升了随机森林预测的可解释性,且无需重新训练现有随机森林模型即可应用于任何预测任务。在实证实验中,我们发现简化后的预测在性能表现上可与原始预测相当甚至更优。我们通过随机森林的程式化分析模型探究了这一发现的统计根源。该模型表明,当未知的真实预测分布包含大量估计不精确的微小权重时,简化处理尤其具有应用前景。