We introduce LongDA, a data analysis benchmark for evaluating LLM-based agents under documentation-intensive analytical workflows. In contrast to existing benchmarks that assume well-specified schemas and inputs, LongDA targets real-world settings in which navigating long documentation and complex data is the primary bottleneck. To this end, we manually curate raw data files, long and heterogeneous documentation, and expert-written publications from 17 publicly available U.S. national surveys, from which we extract 505 analytical queries grounded in real analytical practice. Solving these queries requires agents to first retrieve and integrate key information from multiple unstructured documents, before performing multi-step computations and writing executable code, which remains challenging for existing data analysis agents. To support the systematic evaluation under this setting, we develop LongTA, a tool-augmented agent framework that enables document access, retrieval, and code execution, and evaluate a range of proprietary and open-source models. Our experiments reveal substantial performance gaps even among state-of-the-art models, highlighting the challenges researchers should consider before applying LLM agents for decision support in real-world, high-stakes analytical settings.
翻译:本文提出LongDA,一个用于评估基于大语言模型的智能体在文档密集型分析工作流中性能的数据分析基准。与现有假设模式与输入均已明确定义的基准不同,LongDA针对现实场景,其中导航长文档与复杂数据是主要瓶颈。为此,我们手动整理了来自17项公开美国全国性调查的原始数据文件、长且异构的文档及专家撰写的出版物,并从中提取出基于真实分析实践的505个分析查询。解决这些查询要求智能体首先从多个非结构化文档中检索并整合关键信息,然后执行多步计算并编写可执行代码,这对现有数据分析智能体仍具挑战性。为支持在此场景下的系统评估,我们开发了LongTA——一个支持文档访问、检索与代码执行的工具增强型智能体框架,并评估了一系列专有与开源模型。实验结果表明,即使在最先进的模型之间也存在显著的性能差距,凸显了研究者在将大语言模型智能体应用于现实世界高风险分析场景进行决策支持前需考量的挑战。