Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery

from arxiv, Preprint. Code and datasets are available at https://github.com/omron-sinicx/srsd-benchmark https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_easy https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_medium https://huggingface.co/datasets/yoshitomo-matsubara/srsd-feynman_hard and another three sets of SRSD datasets with dummy variables (See Appendix)

This paper revisits datasets and evaluation criteria for Symbolic Regression (SR), specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. We also create another 120 datasets that contain dummy variables to examine whether SR methods can choose necessary variables only. Besides, we propose to use normalized edit distances (NED) between a predicted equation and the true equation trees for addressing a critical issue that existing SR metrics are either binary or errors between the target values and an SR model's predicted values for a given input. We conduct experiments on our new SRSD datasets using six SR methods. The experimental results show that we provide a more realistic performance evaluation, and our user study shows that the NED correlates with human judges significantly more than an existing SR metric.

翻译：本文重新审视了符号回归的数据集和评价标准，特别聚焦于其在科学发现方面的潜力。基于《费曼物理学讲义》中现有数据集所采用的公式集，我们重新创建了120个数据集，以探讨符号回归用于科学发现的性能。针对每个SRSD数据集，我们仔细审查公式及其变量的属性，设计了合理且真实的采样范围，从而使新的SRSD数据集能够评估符号回归方法在科学发现中的潜力，例如能否从这些数据集中（重新）发现物理定律。此外，我们还创建了另外120个包含虚拟变量的数据集，以检验符号回归方法是否能够仅选择必要的变量。同时，我们提出使用预测方程与真实方程树之间的归一化编辑距离，以解决现有符号回归指标要么是二元的、要么是给定输入下目标值与符号回归模型预测值之间误差这一关键问题。我们使用六种符号回归方法对新的SRSD数据集进行了实验。实验结果表明，我们提供了更为真实的性能评估，而用户研究显示，与现有符号回归指标相比，归一化编辑距离与人类评判的相关性显著更高。