LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Mathematical equations have been unreasonably effective in describing complex natural phenomena across various scientific disciplines. However, discovering such insightful equations from data presents significant challenges due to the necessity of navigating extremely high-dimensional combinatorial and nonlinear hypothesis spaces. Traditional methods of equation discovery, commonly known as symbolic regression, largely focus on extracting equations from data alone, often neglecting the rich domain-specific prior knowledge that scientists typically depend on. To bridge this gap, we introduce LLM-SR, a novel approach that leverages the extensive scientific knowledge and robust code generation capabilities of Large Language Models (LLMs) to discover scientific equations from data in an efficient manner. Specifically, LLM-SR treats equations as programs with mathematical operators and combines LLMs' scientific priors with evolutionary search over equation programs. The LLM iteratively proposes new equation skeleton hypotheses, drawing from its physical understanding, which are then optimized against data to estimate skeleton parameters. We demonstrate LLM-SR's effectiveness across three diverse scientific domains, where it discovers physically accurate equations that provide significantly better fits to in-domain and out-of-domain data compared to the well-established symbolic regression baselines. Incorporating scientific prior knowledge also enables LLM-SR to search the equation space more efficiently than baselines. Code is available at: https://github.com/deep-symbolic-mathematics/LLM-SR

翻译：数学方程在描述各科学领域中复杂自然现象方面展现出惊人的有效性。然而，从数据中发现此类具有深刻洞察力的方程面临重大挑战，这源于必须在极高维的组合与非线性的假设空间中进行探索。传统的方程发现方法（通常称为符号回归）主要侧重于仅从数据中提取方程，往往忽略了科学家通常依赖的丰富领域特定先验知识。为弥补这一差距，我们提出了LLM-SR，这是一种新颖的方法，它利用大型语言模型（LLMs）的广泛科学知识和强大的代码生成能力，以高效的方式从数据中发现科学方程。具体而言，LLM-SR将方程视为由数学运算符构成的程序，并将LLMs的科学先验与针对方程程序的进化搜索相结合。LLM基于其物理理解迭代地提出新的方程骨架假设，随后这些假设会针对数据进行优化以估计骨架参数。我们在三个不同的科学领域中验证了LLM-SR的有效性，结果表明，与成熟的符号回归基线方法相比，LLM-SR发现的物理精确方程对领域内和领域外数据提供了显著更好的拟合。融入科学先验知识也使LLM-SR能够比基线方法更高效地搜索方程空间。代码发布于：https://github.com/deep-symbolic-mathematics/LLM-SR