Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.
翻译:摘要:基于遗传编程(GP)的符号回归(SR)旨在直接从数据中发现可解释的数学表达式。尽管其实验效果显著,但为何基于GP的符号回归能够泛化到训练数据之外的理论理解仍十分有限。本文针对以表达式树表示的SR模型提供了学习理论分析。我们推导出在树规模、深度和可学习常数约束下GP式SR的泛化界。该结果将泛化差距分解为两个可解释分量:结构选择项(反映选择表达式树结构的组合复杂度)和常数拟合项(捕捉固定结构内优化数值常数的复杂度)。该分解为GP中若干广泛使用的实践(包括简约压力、深度限制、数值稳定算子和区间算术)提供了理论视角。特别地,我们的分析表明结构约束如何降低假设类的复杂度,而稳定性机制如何控制预测对参数扰动的敏感性。通过将这些实践设计选择与泛化界中的显式复杂度项相联系,本文为基于GP的符号回归中常见经验行为提供了原理性解释,并有助于更严谨地理解其泛化特性。