In recent years, genetic programming (GP)-based evolutionary feature construction has achieved significant success. However, a primary challenge with evolutionary feature construction is its tendency to overfit the training data, resulting in poor generalization on unseen data. In this research, we draw inspiration from PAC-Bayesian theory and propose using sharpness-aware minimization in function space to discover symbolic features that exhibit robust performance within a smooth loss landscape in the semantic space. By optimizing sharpness in conjunction with cross-validation loss, as well as designing a sharpness reduction layer, the proposed method effectively mitigates the overfitting problem of GP, especially when dealing with a limited number of instances or in the presence of label noise. Experimental results on 58 real-world regression datasets show that our approach outperforms standard GP as well as six state-of-the-art complexity measurement methods for GP in controlling overfitting. Furthermore, the ensemble version of GP with sharpness-aware minimization demonstrates superior performance compared to nine fine-tuned machine learning and symbolic regression algorithms, including XGBoost and LightGBM.
翻译:近年来,基于遗传编程(GP)的进化特征构建取得了显著成功。然而,进化特征构建面临的一个主要挑战是其容易过拟合训练数据,导致对未见数据的泛化能力较差。在本研究中,我们借鉴PAC-贝叶斯理论,提出在函数空间中采用锐度感知最小化,以发现语义空间中在平滑损失 landscape 上具有稳健性能的符号特征。通过结合交叉验证损失优化锐度,并设计锐度缩减层,所提出的方法有效缓解了GP的过拟合问题,特别是在样本数量有限或存在标签噪声的情况下。在58个真实回归数据集上的实验结果表明,我们的方法在控制过拟合方面优于标准GP以及六种最先进的GP复杂度度量方法。此外,采用锐度感知最小化的GP集成版本相较于包括XGBoost和LightGBM在内的九种精细调优的机器学习与符号回归算法,展现出更优越的性能。