Symbolic Regression (SR) algorithms attempt to learn analytic expressions which fit data accurately and in a highly interpretable manner. Conventional SR suffers from two fundamental issues which we address here. First, these methods search the space stochastically (typically using genetic programming) and hence do not necessarily find the best function. Second, the criteria used to select the equation optimally balancing accuracy with simplicity have been variable and subjective. To address these issues we introduce Exhaustive Symbolic Regression (ESR), which systematically and efficiently considers all possible equations -- made with a given basis set of operators and up to a specified maximum complexity -- and is therefore guaranteed to find the true optimum (if parameters are perfectly optimised) and a complete function ranking subject to these constraints. We implement the minimum description length principle as a rigorous method for combining these preferences into a single objective. To illustrate the power of ESR we apply it to a catalogue of cosmic chronometers and the Pantheon+ sample of supernovae to learn the Hubble rate as a function of redshift, finding $\sim$40 functions (out of 5.2 million trial functions) that fit the data more economically than the Friedmann equation. These low-redshift data therefore do not uniquely prefer the expansion history of the standard model of cosmology. We make our code and full equation sets publicly available.
翻译:符号回归(SR)算法旨在学习能够精准拟合数据且具有高度可解释性的解析表达式。传统符号回归存在两个根本性问题,我们在此予以解决。其一,这些方法以随机方式搜索空间(通常采用遗传编程),因此未必能找到最优函数;其二,用于平衡准确性与简洁性的方程选择准则具有可变性和主观性。为解决这些问题,我们提出穷举符号回归(ESR)方法,该方法系统且高效地考虑所有可能的方程——这些方程基于给定的算子基组构建,且复杂度不超过指定的最大值——因此能保证找到真实的最优解(若参数被完美优化),并给出在这些约束条件下的完整函数排序。我们采用最小描述长度原理作为严谨方法,将这些偏好合并为单一目标。为展示ESR的强大能力,我们将其应用于宇宙学计时器数据集和Pantheon+超新星样本,学习哈勃率作为红移的函数,发现约40个函数(在520万个候选函数中)能以比弗里德曼方程更经济的方式拟合数据。因此,这些低红移数据并非唯一偏好标准宇宙学模型的膨胀历史。我们公开提供相关代码及完整方程集。