The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique -- which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm -- we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.
翻译:摘要:连续数值属性的离散化仍是决策树归纳中持续存在的计算瓶颈,尤其在数据集维度扩展时更为突出。基于近期提出的MSD-Splitting技术(利用经验均值和标准差对连续数据进行分箱,显著提升C4.5算法的效率与精度),我们提出自适应MSD-Splitting(AMSD)。标准MSD-Splitting在近似对称分布中表现高效,但其对固定一倍标准差阈值的刚性依赖,可能导致高度偏斜数据中出现灾难性信息损失——这是真实世界生物医学与金融数据集的常见特征。AMSD通过根据特征偏度动态调整标准差乘数来解决此问题:在密集区域收窄区间以保留判别分辨率。此外,我们将AMSD集成到集成方法中,具体提出随机森林-AMSD(RF-AMSD)框架。在人口普查收入、心脏病、乳腺癌及森林覆盖类型数据集上的实证评估表明:AMSD相比标准MSD-Splitting实现2-4%的准确率提升,同时保持近乎相同的O(N)时间复杂度缩减(相较于O(N log N)穷举搜索)。我们的随机森林扩展版本以标准计算成本的一小部分达到最优准确率,证实了自适应统计分箱在大规模集成学习架构中的可行性。