Symbolic Regression (SR) aims to discover interpretable equations from observational data, with the potential to reveal underlying principles behind natural phenomena. However, existing approaches often fall into the Pseudo-Equation Trap: producing equations that fit observations well but remain inconsistent with fundamental scientific principles. A key reason is that these approaches are dominated by empirical risk minimization, lacking explicit constraints to ensure scientific consistency. To bridge this gap, we propose PG-SR, a prior-guided SR framework built upon a three-stage pipeline consisting of warm-up, evolution, and refinement. Throughout the pipeline, PG-SR introduces a prior constraint checker that explicitly encodes domain priors as executable constraint programs, and employs a Prior Annealing Constrained Evaluation (PACE) mechanism during the evolution stage to progressively steer discovery toward scientifically consistent regions. Theoretically, we prove that PG-SR reduces the Rademacher complexity of the hypothesis space, yielding tighter generalization bounds and establishing a guarantee against pseudo-equations. Experimentally, PG-SR outperforms state-of-the-art baselines across diverse domains, maintaining robustness to varying prior quality, noisy data, and data scarcity.
翻译:符号回归旨在从观测数据中发现可解释的方程,具有揭示自然现象背后潜在原理的潜力。然而,现有方法常陷入伪方程陷阱:生成的方程虽能良好拟合观测数据,却与基础科学原理不一致。一个关键原因在于这些方法过度依赖经验风险最小化,缺乏确保科学一致性的显式约束。为弥合这一差距,我们提出PG-SR——一个基于预热、进化和精炼三阶段流程构建的先验引导符号回归框架。在整个流程中,PG-SR引入了先验约束检查器,将领域先验显式编码为可执行的约束程序,并在进化阶段采用先验退火约束评估机制,逐步引导发现过程朝向科学一致的区域。理论上,我们证明PG-SR能降低假设空间的Rademacher复杂度,从而获得更紧的泛化界,并建立对伪方程的防御保证。实验表明,PG-SR在多个领域均优于当前最先进的基线方法,且对不同先验质量、噪声数据和数据稀缺情况均保持鲁棒性。