We analyse the search behaviour of genetic programming for symbolic regression in practically relevant but limited settings, allowing exhaustive enumeration of all solutions. This enables us to quantify the success probability of finding the best possible expressions, and to compare the search efficiency of genetic programming to random search in the space of semantically unique expressions. This analysis is made possible by improved algorithms for equality saturation, which we use to improve the Exhaustive Symbolic Regression algorithm; this produces the set of semantically unique expression structures, orders of magnitude smaller than the full symbolic regression search space. We compare the efficiency of random search in the set of unique expressions and genetic programming. For our experiments we use two real-world datasets where symbolic regression has been used to produce well-fitting univariate expressions: the Nikuradse dataset of flow in rough pipes and the Radial Acceleration Relation of galaxy dynamics. The results show that genetic programming in such limited settings explores only a small fraction of all unique expressions, and evaluates expressions repeatedly that are congruent to already visited expressions.
翻译:我们分析了遗传编程在实际相关但受限的符号回归设置中的搜索行为,从而能够对所有解进行穷举枚举。这使得我们可以量化找到最佳可能表达式的成功概率,并比较遗传编程与语义唯一表达式空间中的随机搜索的搜索效率。这一分析得益于改进的等式饱和算法,我们利用该算法优化了穷举符号回归算法;该算法生成了语义唯一表达式结构的集合,其规模比完整的符号回归搜索空间小数个数量级。我们比较了随机搜索在唯一表达式集合中的效率与遗传编程的效率。实验中使用了两个真实世界数据集,其中符号回归曾用于生成拟合度良好的单变量表达式:粗糙管道流动的尼古拉兹数据集和星系动力学中的径向加速度关系。结果表明,在此类受限设置中,遗传编程仅探索了所有唯一表达式中的极小一部分,并反复评估与已访问表达式同构的表达式。