Genetic programming-based feature construction has achieved significant success in recent years as an automated machine learning technique to enhance learning performance. However, overfitting remains a challenge that limits its broader applicability. To improve generalization, we prove that vicinal risk, estimated through noise perturbation or mixup-based data augmentation, is bounded by the sum of empirical risk and a regularization term-either finite difference or the vicinal Jensen gap. Leveraging this decomposition, we propose an evolutionary feature construction framework that jointly optimizes empirical risk and the vicinal Jensen gap to control overfitting. Since datasets may vary in noise levels, we develop a noise estimation strategy to dynamically adjust regularization strength. Furthermore, to mitigate manifold intrusion-where data augmentation may generate unrealistic samples that fall outside the data manifold-we propose a manifold intrusion detection mechanism. Experimental results on 58 datasets demonstrate the effectiveness of Jensen gap minimization compared to other complexity measures. Comparisons with 15 machine learning algorithms further indicate that genetic programming with the proposed overfitting control strategy achieves superior performance.
翻译:基于遗传编程的特征构造作为提升学习性能的自动化机器学习技术,近年来取得了显著成功。然而,过拟合问题仍然是限制其更广泛应用的一个挑战。为提升泛化能力,我们证明了通过噪声扰动或基于混合的数据增强所估计的邻域风险,可由经验风险与一个正则项(有限差分或邻域詹森间隙)之和所界定。利用这一分解,我们提出了一个进化特征构造框架,该框架联合优化经验风险与邻域詹森间隙以控制过拟合。鉴于数据集的噪声水平可能不同,我们开发了一种噪声估计策略以动态调整正则化强度。此外,为缓解流形侵入问题(即数据增强可能产生落在数据流形之外的非现实样本),我们提出了一种流形侵入检测机制。在58个数据集上的实验结果表明,与其他复杂度度量方法相比,詹森间隙最小化具有显著有效性。与15种机器学习算法的进一步比较表明,采用所提出的过拟合控制策略的遗传编程能够实现更优的性能。