The search for symbolic regression models with genetic programming (GP) has a tendency of revisiting expressions in their original or equivalent forms. Repeatedly evaluating equivalent expressions is inefficient, as it does not immediately lead to better solutions. However, evolutionary algorithms require diversity and should allow the accumulation of inactive building blocks that can play an important role at a later point. The equality graph is a data structure capable of compactly storing expressions and their equivalent forms allowing an efficient verification of whether an expression has been visited in any of their stored equivalent forms. We exploit the e-graph to adapt the subtree operators to reduce the chances of revisiting expressions. Our adaptation, called eggp, stores every visited expression in the e-graph, allowing us to filter out from the available selection of subtrees all the combinations that would create already visited expressions. Results show that, for small expressions, this approach improves the performance of a simple GP algorithm to compete with PySR and Operon without increasing computational cost. As a highlight, eggp was capable of reliably delivering short and at the same time accurate models for a selected set of benchmarks from SRBench and a set of real-world datasets.
翻译:遗传编程(GP)在搜索符号回归模型时,倾向于重复访问原始形式或等价形式的表达式。重复评估等价表达式是低效的,因为它不会立即带来更好的解。然而,进化算法需要多样性,并应允许积累可能在后续阶段发挥重要作用的非活跃构建块。等价图是一种能够紧凑存储表达式及其等价形式的数据结构,它可以高效地验证一个表达式是否已被访问过(以任何存储的等价形式)。我们利用e-graph来调整子树算子,以减少重复访问表达式的可能性。我们的改进方法称为eggp,它将每个访问过的表达式存储在e-graph中,从而能够从可选的子树组合中过滤掉所有会生成已访问表达式的组合。结果表明,对于小型表达式,这种方法在未增加计算成本的情况下,提升了简单GP算法的性能,使其能够与PySR和Operon竞争。值得注意的是,eggp能够为从SRBench选取的一组基准测试和一组真实世界数据集,可靠地提供简短且同时准确的模型。