Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.
翻译:符号回归是一种通过遗传编程从数据中直接发现可解释数学表达式的技术。尽管其在实践中取得了显著成功,但为何基于遗传编程的符号回归能够泛化到训练数据之外,其理论理解仍然有限。本研究对以表达式树表示的符号回归模型进行了学习理论分析。在树大小、深度及可学习常数约束下,我们推导了基于遗传编程风格的符号回归的泛化界。该结果将泛化差距分解为两个可解释的组成部分:结构选择项——反映选择表达式树结构的组合复杂度,以及常数拟合项——捕捉在固定结构内优化数值常数的复杂度。这一分解为遗传编程中多个广泛实践提供了理论视角,包括简约压力、深度限制、数值稳定算子及区间算术。特别地,我们的分析表明结构约束如何降低假设类别复杂度,而稳定性机制如何控制预测对参数扰动的敏感性。通过将这些实际设计选择与泛化界中的显式复杂度项相关联,本研究为遗传编程符号回归中常见的经验行为提供了原理性解释,并致力于加深对其泛化特性的严格理解。