Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models' strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.
翻译:大型语言模型正越来越多地被用于金融、医疗或环境监测等领域的自动化决策系统。时间序列数据在这些领域中普遍存在,但难以自动处理。LLM智能体能否分析时间序列?我们考察了三种方法:向智能体提供原始数值数据、将LLM作为编码智能体使用,或两者的结合。在编码智能体设置中,模型通过Python代码迭代查询数据。利用两个时间序列理解基准,我们证明具备代码访问权限的智能体性能可比处理原始数据的模型高出10%。然而,即使表现最佳的智能体仍有约22%-34%的问题回答错误。为深入探究模型的策略与推理漏洞,我们采用强大的LLM裁判分析模型输出。分析表明,编码智能体能够选择恰当的统计检验方法,但常忽视关键细微差异。而能访问原始数据的模型可通过粗略估算得出正确结论。