Uncovering hidden symbolic laws from time series data, as an aspiration dating back to Kepler's discovery of planetary motion, remains a core challenge in scientific discovery and artificial intelligence. While Large Language Models show promise in structured reasoning tasks, their ability to infer interpretable, context-aligned symbolic structures from time series data is still underexplored. To systematically evaluate this capability, we introduce SymbolBench, a comprehensive benchmark designed to assess symbolic reasoning over real-world time series across three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. Unlike prior efforts limited to simple algebraic equations, SymbolBench spans a diverse set of symbolic forms with varying complexity. We further propose a unified framework that integrates LLMs with genetic programming to form a closed-loop symbolic reasoning system, where LLMs act both as predictors and evaluators. Our empirical results reveal key strengths and limitations of current models, highlighting the importance of combining domain knowledge, context alignment, and reasoning structure to improve LLMs in automated scientific discovery. https://github.com/nuuuh/SymbolBench.
翻译:从时间序列数据中揭示隐藏的符号定律,这一可追溯至开普勒发现行星运动规律的夙愿,至今仍是科学发现与人工智能领域的核心挑战。尽管大型语言模型在结构化推理任务中展现出潜力,但其从时间序列数据中推断可解释、与上下文对齐的符号结构的能力仍未被充分探索。为系统评估这一能力,我们引入SymbolBench——一个综合基准测试,旨在通过三项任务评估真实世界时间序列上的符号推理:多元符号回归、布尔网络推断与因果发现。与早期局限于简单代数方程的研究不同,SymbolBench涵盖了一系列具有不同复杂度的符号形式。我们进一步提出一种统一框架,将LLM与遗传编程整合为闭环符号推理系统,其中LLM同时作为预测器与评估器。实证结果表明了当前模型的关键优势与局限,凸显了结合领域知识、上下文对齐与推理结构对提升LLM在自动化科学发现中性能的重要性。https://github.com/nuuuh/SymbolBench。