This work studies the statistical implications of using features comprised of general linear combinations of covariates to partition the data in randomized decision tree and forest regression algorithms. Using random tessellation theory in stochastic geometry, we provide a theoretical analysis of a class of efficiently generated random tree and forest estimators that allow for oblique splits along such features. We call these estimators oblique Mondrian trees and forests, as the trees are generated by first selecting a set of features from linear combinations of the covariates and then running a Mondrian process that hierarchically partitions the data along these features. Quadratic risk bounds and convergence rates are obtained for the flexible function class of multi-index models for dimension reduction, where the output is assumed to depend on a low-dimensional relevant feature subspace of the input domain. The results highlight how the risk of these estimators depends on the choice of features and quantify how robust the risk is with respect to error between the selected features along which the data is split and the true relevant feature subspace. The asymptotic analysis also provides conditions on the convergence rate a set of estimated relevant features must satisfy for oblique Mondrian estimators to obtain minimax optimal rates of convergence with respect to the dimension of the relevant feature subspace. Additionally, a lower bound on the risk of axis-aligned Mondrian trees (where features are restricted to the set of covariates) is obtained, proving that these estimators are suboptimal for general ridge functions, no matter how the distribution over the covariates used to divide the data at each tree node is weighted.
翻译:本文研究了在随机决策树与森林回归算法中,使用由协变量的一般线性组合构成的特征对数据进行划分的统计意义。借助随机几何中的随机镶嵌理论,我们对一类允许沿此类特征进行斜向分割的高效生成随机树与森林估计量进行了理论分析。我们称这些估计量为斜向Mondrian树与森林,其生成过程为:首先从协变量的线性组合中选择一组特征,随后运行一个Mondrian过程,该过程沿这些特征对数据进行层次化划分。针对降维中的多指标模型这类灵活函数类,我们推导了二次风险界与收敛速率,其中假设输出依赖于输入域的低维相关特征子空间。结果揭示了这些估计量的风险如何依赖于特征选择,并量化了风险对划分数据的所选特征与真实相关特征子空间之间误差的稳健性。渐近分析还给出了斜向Mondrian估计量达到关于相关特征子空间维度的极小化最优收敛速率时,一组估计的相关特征所必须满足的收敛速率条件。此外,我们获得了轴对齐Mondrian树(其中特征限制为协变量集)的风险下界,证明无论用于划分每个树节点数据的协变量分布如何加权,此类估计量对一般岭函数而言均为次优的。