Random Forest is a machine learning method that offers many advantages, including the ability to easily measure variable importance. Class balancing technique is a well-known solution to deal with class imbalance problem. However, it has not been actively studied on RF variable importance. In this paper, we study the effect of class balancing on RF variable importance. Our simulation results show that over-sampling is effective in correctly measuring variable importance in class imbalanced situations with small sample size, while under-sampling fails to differentiate important and non-informative variables. We then propose a variable selection algorithm that utilizes RF variable importance and its confidence interval. Through an experimental study using many real and artificial datasets, we demonstrate that our proposed algorithm efficiently selects an optimal feature set, leading to improved prediction performance in class imbalance problem.
翻译:随机森林作为一种机器学习方法,具备诸多优势,其中包含可便捷度量变量重要性的特性。类别平衡技术是处理不平衡分类问题的经典方案,然而其对随机森林变量重要性度量的影响尚未得到深入研究。本文重点探究类别平衡技术对随机森林变量重要性度量的作用机制。仿真实验表明:在样本量较小的类别不平衡场景下,过采样方法能有效实现变量重要性度量的准确性,而欠采样方法则无法区分重要变量与无效变量。据此提出一种基于随机森林变量重要性及其置信区间的特征选择算法。通过多项真实数据集与人工数据集的实验验证,本算法能高效筛选最优特征子集,显著提升不平衡分类问题中的预测性能。