Symbolic regression plays a crucial role in modern scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data. A grand challenge lies in the arduous search for parsimonious and generalizable mathematical formulas, in an infinite search space, while intending to fit the training data. Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade when handling problems of complexity, which essentially hinders the pace of applying symbolic regression for scientific exploration across interdisciplinary domains. To this end, we introduce a parallelized tree search (PTS) model to efficiently distill generic mathematical expressions from limited data. Through a series of extensive experiments, we demonstrate the superior accuracy and efficiency of PTS for equation discovery, which greatly outperforms the state-of-the-art baseline models on over 80 synthetic and experimental datasets (e.g., lifting its performance by up to 99% accuracy improvement and one-order of magnitude speed up). PTS represents a key advance in accurate and efficient data-driven discovery of symbolic, interpretable models (e.g., underlying physical laws) and marks a pivotal transition towards scalable symbolic learning.
翻译:符号回归在现代科学研究中发挥着至关重要的作用,这得益于其能够从数据中发现简洁且可解释的数学表达式。一个巨大的挑战在于,在无限的搜索空间中,需要艰难地寻找既简约又可泛化的数学公式,同时力求拟合训练数据。现有的算法在处理复杂问题时,十多年来一直面临着准确性与效率的关键瓶颈,这实质上阻碍了符号回归在跨学科领域进行科学探索的应用步伐。为此,我们引入了一种并行化树搜索(PTS)模型,以从有限数据中高效地提炼出通用的数学表达式。通过一系列广泛的实验,我们证明了PTS在方程发现方面具有卓越的准确性和效率,其在超过80个合成与实验数据集上的表现大幅超越了最先进的基线模型(例如,其性能提升高达99%的准确率改进以及一个数量级的加速)。PTS代表了在准确且高效的数据驱动发现符号化、可解释模型(例如,潜在的物理定律)方面的一项关键进展,并标志着向可扩展符号学习的关键转变。