We analyse the search behaviour of genetic programming for symbolic regression in practically relevant but limited settings, allowing exhaustive enumeration of all solutions. This enables us to quantify the success probability of finding the best possible expressions, and to compare the search efficiency of genetic programming to random search in the space of semantically unique expressions. This analysis is made possible by improved algorithms for equality saturation, which we use to improve the Exhaustive Symbolic Regression algorithm; this produces the set of semantically unique expression structures, orders of magnitude smaller than the full symbolic regression search space. We compare the efficiency of random search in the set of unique expressions and genetic programming. For our experiments we use two real-world datasets where symbolic regression has been used to produce well-fitting univariate expressions: the Nikuradse dataset of flow in rough pipes and the Radial Acceleration Relation of galaxy dynamics. The results show that genetic programming in such limited settings explores only a small fraction of all unique expressions, and evaluates expressions repeatedly that are congruent to already visited expressions.
翻译:我们分析了遗传规划在实际上相关但受限的设置中对符号回归的搜索行为,从而能够穷举所有解。这使得我们可以量化找到最优可能表达式的成功概率,并比较遗传规划与在语义唯一表达式空间中进行随机搜索的搜索效率。这一分析得益于改进的等式饱和算法,我们利用该算法改进了穷举符号回归算法;该算法生成语义唯一表达式结构的集合,其规模比完整的符号回归搜索空间小若干数量级。我们比较了在唯一表达式集合中进行随机搜索与遗传规划的效率。在实验中,我们使用了两个符号回归曾用于生成拟合良好的单变量表达式的实际数据集:粗糙管道流动的Nikuradse数据集和星系动力学的径向加速度关系。结果表明,遗传规划在此类受限设置中仅探索了所有唯一表达式的一小部分,并反复评估与已访问表达式等价的表达式。