In the realm of machine and deep learning regression tasks, the role of effective feature engineering (FE) is pivotal in enhancing model performance. Traditional approaches of FE often rely on domain expertise to manually design features for machine learning models. In the context of deep learning models, the FE is embedded in the neural network's architecture, making it hard for interpretation. In this study, we propose to integrate symbolic regression (SR) as an FE process before a machine learning model to improve its performance. We show, through extensive experimentation on synthetic and real-world physics-related datasets, that the incorporation of SR-derived features significantly enhances the predictive capabilities of both machine and deep learning regression models with 34-86% root mean square error (RMSE) improvement in synthetic datasets and 4-11.5% improvement in real-world datasets. In addition, as a realistic use-case, we show the proposed method improves the machine learning performance in predicting superconducting critical temperatures based on Eliashberg theory by more than 20% in terms of RMSE. These results outline the potential of SR as an FE component in data-driven models.
翻译:在机器与深度学习回归任务领域,有效的特征工程(FE)对提升模型性能至关重要。传统FE方法通常依赖领域知识手动设计机器学习模型的特征;而在深度学习模型背景下,FE内嵌于神经网络架构中,导致可解释性困难。本研究提出将符号回归(SR)作为特征工程过程集成至机器学习模型之前,以提升其性能。通过在合成数据集和真实物理数据集上的大量实验,我们证明融入SR衍生特征可显著增强机器与深度学习回归模型的预测能力:在合成数据集中均方根误差(RMSE)改善34-86%,在真实数据集中改善4-11.5%。此外,作为实际应用案例,我们展示了所提出方法在基于Eliashberg理论预测超导临界温度时,均方根误差(RMSE)指标提升超过20%。这些结果揭示了符号回归作为数据驱动模型中特征工程组件的潜力。