Random forest is a popular machine learning approach for the analysis of high-dimensional data because it is flexible and provides variable importance measures for the selection of relevant features. However, the complex relationships between the features are usually not considered for the selection and thus also neglected for the characterization of the analysed samples. Here we propose two novel approaches that focus on the mutual impact of features in random forests. Mutual forest impact (MFI) is a relation parameter that evaluates the mutual association of the featurs to the outcome and, hence, goes beyond the analysis of correlation coefficients. Mutual impurity reduction (MIR) is an importance measure that combines this relation parameter with the importance of the individual features. MIR and MFI are implemented together with testing procedures that generate p-values for the selection of related and important features. Applications to various simulated data sets and the comparison to other methods for feature selection and relation analysis show that MFI and MIR are very promising to shed light on the complex relationships between features and outcome. In addition, they are not affected by common biases, e.g. that features with many possible splits or high minor allele frequencies are prefered.
翻译:随机森林是一种用于高维数据分析的流行机器学习方法,因其灵活性并提供用于选择相关特征的重要性指标。然而,特征之间的复杂关系通常未被纳入选择过程,因此也忽略了分析样本的表征。本文提出两种关注随机森林中特征相互影响的新方法。互作用森林影响(MFI)是一种评估特征与结果变量关联程度的关系参数,其分析超越了相关系数的局限性。互作用杂质减少(MIR)是一种重要性度量方法,将该关系参数与单个特征重要性相结合。MIR与MFI配合显著性检验程序实现,可生成用于筛选相关及重要特征的p值。通过对多种模拟数据集的应用及与其他特征选择与关系分析方法的比较,表明MFI和MIR在揭示特征与结果之间复杂关系方面极具潜力。此外,它们不受常见偏差的影响(例如偏好分裂点较多或次要等位基因频率较高的特征)。