Symbolic Regression (SR) can generate interpretable, concise expressions that fit a given dataset, allowing for more human understanding of the structure than black-box approaches. The addition of background knowledge (in the form of symbolic mathematical constraints) allows for the generation of expressions that are meaningful with respect to theory while also being consistent with data. We specifically examine the addition of constraints to traditional genetic algorithm (GA) based SR (PySR) as well as a Markov-chain Monte Carlo (MCMC) based Bayesian SR architecture (Bayesian Machine Scientist), and apply these to rediscovering adsorption equations from experimental, historical datasets. We find that, while hard constraints prevent GA and MCMC SR from searching, soft constraints can lead to improved performance both in terms of search effectiveness and model meaningfulness, with computational costs increasing by about an order-of-magnitude. If the constraints do not correlate well with the dataset or expected models, they can hinder the search of expressions. We find Bayesian SR is better these constraints (as the Bayesian prior) than by modifying the fitness function in the GA
翻译:符号回归(Symbolic Regression, SR)能够生成可解释、简洁的表达式以拟合给定数据集,相比黑箱方法更能增进人类对结构的理解。在符号数学约束的形式下融入背景知识,可以生成在理论上具有意义且与数据一致的表达式。我们分别研究了在基于遗传算法(GA)的传统SR(PySR)和基于马尔可夫链蒙特卡洛(MCMC)的贝叶斯SR架构(Bayesian Machine Scientist)中添加约束的方法,并将其应用于从实验历史数据集中重新发现吸附方程。我们发现,尽管硬约束会阻碍GA和MCMC的SR搜索,但软约束能提升搜索效率和模型意义性两方面的性能,同时计算成本增加约一个数量级。如果约束与数据集或预期模型相关性不佳,则可能阻碍表达式搜索。我们进一步发现,贝叶斯SR通过将约束作为贝叶斯先验,比在GA中通过修改适应度函数能更好地利用这些约束。