Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.
翻译:通过符号化、可解释的公式解释观测现象是科学的基本目标。近年来,大型语言模型凭借其广泛的领域知识和强大的推理能力,已成为符号方程发现领域的重要工具。然而,现有基于LLM的系统大多试图直接从数据中推测方程,未能模拟科学家通常遵循的多步骤推理过程:首先推断对称性等物理属性,随后将其作为先验约束候选方程空间。本文提出KeplerAgent——一个显式遵循科学推理过程的智能体框架。该智能体通过协调基于物理原理的工具提取中间结构,并利用这些结果配置PySINDy、PySR等符号回归引擎的函数库与结构约束。在系列物理方程基准测试中,相较于传统基线方法与LLM方法,KeplerAgent在符号准确率方面实现显著提升,并对噪声数据展现出更强的鲁棒性。