Random Forests have become a widely used tool in machine learning since their introduction in 2001, known for their strong performance in classification and regression tasks. One key feature of Random Forests is the Random Forest Permutation Importance Measure (RFPIM), an internal, non-parametric measure of variable importance. While widely used, theoretical work on RFPIM is sparse, and most research has focused on empirical findings. However, recent progress has been made, such as establishing consistency of the RFPIM, although a mathematical analysis of its asymptotic distribution is still missing. In this paper, we provide a formal proof of a Central Limit Theorem for RFPIM using U-Statistics theory. Our approach deviates from the conventional Random Forest model by assuming a random number of trees and imposing conditions on the regression functions and error terms, which must be bounded and additive, respectively. Our result aims at improving the theoretical understanding of RFPIM rather than conducting comprehensive hypothesis testing. However, our contributions provide a solid foundation and demonstrate the potential for future work to extend to practical applications which we also highlight with a small simulation study.
翻译:随机森林自2001年提出以来,已成为机器学习中广泛使用的工具,以其在分类和回归任务中的强大性能而闻名。随机森林的一个关键特征是随机森林置换重要性度量(RFPIM),这是一种内部的、非参数的变量重要性度量。尽管RFPIM被广泛使用,但其理论研究相对匮乏,大多数研究集中于实证发现。然而,近期已取得一些进展,例如建立了RFPIM的一致性,但其渐近分布的数学分析仍然缺失。本文利用U-统计量理论,为RFPIM的中心极限定理提供了形式化证明。我们的方法与传统随机森林模型不同,假设树的数量是随机的,并对回归函数和误差项施加条件——回归函数必须有界,误差项必须具有可加性。我们的结果旨在增进对RFPIM的理论理解,而非进行全面的假设检验。然而,我们的贡献为后续研究奠定了坚实基础,并展示了扩展到实际应用的潜力,这一点我们也通过一个小型模拟研究加以说明。