Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly matches the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
翻译:随机森林因其在较少调参下保持高水平预测性能,被视为最优秀的开箱即用分类与回归算法之一。从训练后的随机森林可计算成对邻近度,用于衡量数据点在监督任务中的相似性。随机森林邻近度已被广泛应用于变量重要性识别、数据填补、异常检测和数据可视化等场景。然而,现有随机森林邻近度的定义未能准确反映随机森林学习到的数据几何结构。本文提出一种名为"随机森林几何与精度保持邻近度"(RF-GAP)的新定义。我们证明,使用RF-GAP进行加权求和(回归)或多数投票(分类)时,其结果与袋外随机森林预测完全一致,从而捕获了随机森林学习到的数据几何结构。实验表明,这种改进的几何表示在数据填补等任务中优于传统随机森林邻近度,并能提供与学习到的数据几何结构相一致的异常检测和可视化结果。