Random forests (RFs) are widely used for prediction and variable importance analysis and are often believed to capture any types of interactions via recursive splitting. However, since the splits are chosen locally, interactions are only reliably captured when at least one involved covariate has a marginal effect. We introduce unity forests (UFOs), an RF variant designed to better exploit interactions involving covariates without marginal effects. In UFOs, the first few splits of each tree are optimized jointly across a random covariate subset to form a "tree root" capturing such interactions; the remainder is grown conventionally. We further propose the unity variable importance measure (VIM), which is based on out-of-bag split criterion values from the tree roots. Here, only a small fraction of tree root splits with the highest in-bag criterion values are considered per covariate, reflecting that covariates with purely interaction-based effects are discriminative only if a split in an interacting covariate occurred earlier in the tree. Finally, we introduce covariate-representative tree roots (CRTRs), which select representative tree roots per covariate and provide interpretable insight into the conditions - marginal or interactive - under which each covariate has its strongest effects. In a simulation study, the unity VIM reliably identified interacting covariates without marginal effects, unlike conventional RF-based VIMs. In a large-scale real-data comparison, UFOs achieved higher discrimination and predictive accuracy than standard RFs, with comparable calibration. The CRTRs reproduced the covariates' true effect types reliably in simulated data and provided interesting insights in a real data analysis.
翻译:随机森林(RFs)被广泛用于预测和变量重要性分析,通常被认为能够通过递归分割捕捉任何类型的交互效应。然而,由于分割是在局部选择的,只有当至少一个相关协变量具有边际效应时,交互作用才能被可靠地捕捉。本文提出统一森林(UFOs),一种旨在更好地利用不具边际效应的协变量之间交互作用的随机森林变体。在UFOs中,每棵树的前几次分割会在随机协变量子集上联合优化,形成一个捕捉此类交互作用的"树根";剩余部分则按常规方式生长。我们进一步提出统一变量重要性度量(VIM),该度量基于树根部分的袋外分割准则值。在此方法中,每个协变量仅考虑袋内准则值最高的一小部分树根分割,这反映了仅基于交互作用的协变量只有在树中较早发生交互协变量的分割时才具有区分能力。最后,我们引入协变量代表性树根(CRTRs),该方法为每个协变量选择代表性树根,并以可解释的方式揭示每个协变量在其效应最强时(无论是边际效应还是交互效应)的作用条件。在模拟研究中,与传统的基于随机森林的VIM不同,统一VIM能够可靠地识别不具边际效应的交互协变量。在大规模真实数据比较中,UFOs在具有可比校准度的前提下,实现了比标准随机森林更高的区分度和预测准确度。CRTRs在模拟数据中可靠地复现了协变量的真实效应类型,并在真实数据分析中提供了有价值的洞见。