Statistical wisdom suggests that very complex models, interpolating training data, will be poor at predicting unseen examples.Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity.While it is widely known that fully-grown decision trees interpolate and, in turn, have bad predictive performances, the same behavior is yet to be analyzed for Random Forests (RF).In this paper, we study the trade-off between interpolation and consistency for several types of RF algorithms. Theoretically, we prove that interpolation regimes and consistency cannot be achieved simultaneously for several non-adaptive RF.Since adaptivity seems to be the cornerstone to bring together interpolation and consistency, we study interpolating Median RF which are proved to be consistent in the interpolating regime. This is the first result conciliating interpolation and consistency for RF, highlighting that the averaging effect introduced by feature randomization is a key mechanism, sufficient to ensure the consistency in the interpolation regime and beyond.Numerical experiments show that Breiman's RF are consistent while exactly interpolating, when no bootstrap step is involved.We theoretically control the size of the interpolation area, which converges fast enough to zero, giving a necessary condition for exact interpolation and consistency to occur in conjunction.
翻译:统计智慧表明,过度复杂的模型(如插值训练数据的模型)在预测未见样本时表现糟糕。然而,这一格言近期因良性过拟合机制(尤其在参数模型研究中)的发现而受到挑战:尽管模型高度复杂,泛化能力仍可能得以保留。尽管全生长决策树存在插值现象且预测性能较差广为人知,但随机森林(RF)是否具有相同行为尚待分析。本文研究了多种RF算法中插值与一致性的权衡关系。理论上,我们证明了若干非自适应RF无法同时实现插值机制与一致性。由于自适应性似乎是调和插值与一致性的关键,我们研究了可证明在插值机制中保持一致性的插值中位数RF。这是RF领域首次调和插值与一致性的成果,凸显了特征随机化引入的平均效应是足以确保插值机制及更广泛情况下一致性的关键机制。数值实验表明,当不涉及bootstrap步骤时,Breiman提出的RF在精确插值的同时保持一致性。我们从理论上控制了插值区域的规模,其收敛速度足够快以达到零值,这为精确插值与一致性同时成立提供了必要条件。