Current Challenges of Symbolic Regression: Optimization, Selection, Model Simplification, and Benchmarking

Symbolic Regression (SR) is a regression method that aims to discover mathematical expressions that describe the relationship between variables, and it is often implemented through Genetic Programming, a metaphor for the process of biological evolution. Its appeal lies in combining predictive accuracy with interpretable models, but its promise is limited by several long-standing challenges: parameters are difficult to optimize, the selection of solutions can affect the search, and models often grow unnecessarily complex. In addition, current methods must be constantly re-evaluated to understand the SR landscape. This thesis addresses these challenges through a sequence of studies conducted throughout the doctorate, each focusing on an important aspect of the SR search process. First, I investigate parameter optimization, obtaining insights into its role in improving predictive accuracy, albeit with trade-offs in runtime and expression size. Next, I study parent selection, exploring $ε$-lexicase to select parents more likely to generate good performing offspring. The focus then turns to simplification, where I introduce a novel method based on memoization and locality-sensitive hashing that reduces redundancy and yields simpler, more accurate models. All of these contributions are implemented into a multi-objective evolutionary SR library, which achieves Pareto-optimal performance in terms of accuracy and simplicity on benchmarks of real-world and synthetic problems, outperforming several contemporary SR approaches. The thesis concludes by proposing changes to a famous large-scale symbolic regression benchmark suite, then running the experiments to assess the symbolic regression landscape, demonstrating that a SR method with the contributions presented in this thesis achieves Pareto-optimal performance.

翻译：符号回归（SR）是一种回归方法，旨在发现描述变量之间关系的数学表达式，通常通过遗传编程（一种模拟生物进化过程的隐喻）实现。其吸引力在于将预测准确性与可解释模型相结合，但其潜力受到若干长期存在的挑战限制：参数难以优化，解决方案的选择可能影响搜索过程，且模型常变得不必要的复杂。此外，当前方法需不断重新评估以理解符号回归的发展现状。本论文通过博士期间进行的一系列研究应对这些挑战，每项研究聚焦于符号回归搜索过程的一个重要方面。首先，我研究参数优化，深入探讨其在提高预测准确性中的作用，尽管在运行时间和表达式大小方面存在权衡。接着，我研究父代选择，探索使用 $ε$-词典排序法选择更可能生成高性能后代的父代。随后重点转向简化，我提出一种基于记忆化和局部敏感哈希的新方法，以减少冗余并产生更简单、更准确的模型。所有这些贡献均被集成到一个多目标进化符号回归库中，该库在真实世界和合成问题的基准测试中实现了准确性与简洁性方面的帕累托最优性能，优于多种当代符号回归方法。论文最后提出对一个著名大规模符号回归基准测试套件的改进方案，并通过实验评估符号回归的发展现状，证明采用本论文所提贡献的符号回归方法能够实现帕累托最优性能。