In machine learning, the exponential growth of data and the associated ``curse of dimensionality'' pose significant challenges, particularly with expansive yet sparse datasets. Addressing these challenges, multi-view ensemble learning (MEL) has emerged as a transformative approach, with feature partitioning (FP) playing a pivotal role in constructing artificial views for MEL. Our study introduces the Semantic-Preserving Feature Partitioning (SPFP) algorithm, a novel method grounded in information theory. The SPFP algorithm effectively partitions datasets into multiple semantically consistent views, enhancing the MEL process. Through extensive experiments on eight real-world datasets, ranging from high-dimensional with limited instances to low-dimensional with high instances, our method demonstrates notable efficacy. It maintains model accuracy while significantly improving uncertainty measures in scenarios where high generalization performance is achievable. Conversely, it retains uncertainty metrics while enhancing accuracy where high generalization accuracy is less attainable. An effect size analysis further reveals that the SPFP algorithm outperforms benchmark models by large effect size and reduces computational demands through effective dimensionality reduction. The substantial effect sizes observed in most experiments underscore the algorithm's significant improvements in model performance.
翻译:在机器学习中,数据的指数级增长及其伴随的“维度灾难”带来了严峻挑战,尤其是面对高维稀疏数据集时。为应对这些挑战,多视图集成学习(MEL)作为一种变革性方法应运而生,其中特征划分(FP)在构建MEL的人工视图方面发挥着关键作用。本研究提出了一种基于信息论的新方法——语义保持特征划分(SPFP)算法。该算法能将数据集有效划分为多个语义一致的视图,从而增强MEL过程。通过在八个真实数据集上开展广泛实验(涵盖高维小样本与低维大样本场景),我们的方法展现出显著效能。在高泛化性能可实现的场景中,该方法在保持模型精度的同时显著提升了不确定性度量;而在高泛化精度较难实现的场景中,它则能保持不确定性指标的同时提升精度。效应量分析进一步表明,SPFP算法以较大效应量优于基准模型,并通过有效降维降低了计算需求。多数实验中观察到的大效应量证实了该算法对模型性能的显著提升作用。