Mathematical equations have been unreasonably effective in describing complex natural phenomena across various scientific disciplines. However, discovering such insightful equations from data presents significant challenges due to the necessity of navigating extremely high-dimensional combinatorial and nonlinear hypothesis spaces. Traditional methods of equation discovery largely focus on extracting equations from data alone, often neglecting the rich domain-specific prior knowledge that scientists typically depend on. To bridge this gap, we introduce LLM-SR, a novel approach that leverages the extensive scientific knowledge and robust code generation capabilities of Large Language Models (LLMs) to discover scientific equations from data in an efficient manner. Specifically, LLM-SR treats equations as programs with mathematical operators and combines LLMs' scientific priors with evolutionary search over equation programs. The LLM iteratively proposes new equation skeletons, drawing from its physical understanding, which are then optimized against data to estimate skeleton parameters. We demonstrate LLM-SR's effectiveness across three diverse scientific domains, where it discovers physically accurate equations that provide significantly better fits to in-domain and out-of-domain data compared to the well-established equation discovery baselines
翻译:数学方程在描述各科学领域的复杂自然现象中展现出非凡的有效性。然而,从数据中发现这类具有洞察力的方程面临重大挑战,因为需要在高维组合与非线性假设空间中导航。传统的方程发现方法主要聚焦于从数据中提取方程,往往忽略了科学家通常依赖的丰富领域特定先验知识。为弥合这一差距,我们提出LLM-SR——一种创新方法,它利用大语言模型(LLMs)的广泛科学知识与强大代码生成能力,高效地从数据中发现科学方程。具体而言,LLM-SR将方程视为包含数学运算符的程序,并结合LLMs的科学先验知识与方程程序的进化搜索。LLM基于其物理理解迭代式地提出新的方程骨架,随后针对数据优化这些骨架以估计参数。我们在三个不同科学领域验证了LLM-SR的有效性,所发现的物理精确方程在域内与域外数据拟合效果均显著优于既有的方程发现基线方法。