Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText. We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which has a large room for improvement. Among open-source models, Deepseek-coder-instruct, a code LLM pretrained on 2T tokens, gets the highest accuracy of 37%. Analysis reveals that models encounter difficulties in data analysis and causal reasoning, and struggle in using causal knowledge and provided data simultaneously. Code and data are in https://github.com/xxxiaol/QRData.
翻译:定量推理是分析数据的关键技能,然而对此类能力的评估仍然有限。为弥补这一空白,我们提出了数据驱动的定量推理基准(QRData),旨在评估大语言模型基于真实数据进行统计与因果推理的能力。该基准包含由教科书、在线学习资料及学术论文中精心构建的411道题目及配套数据表。为对比模型在数据与文本上的定量推理能力,我们额外补充了290道纯文本题目(QRText)作为辅助集。我们评估了自然语言推理、基于程序的推理及智能体推理方法(包括思维链、程序链、ReAct和代码解释器助手)在多种模型上的表现。最强模型GPT-4准确率达58%,仍有较大提升空间。在开源模型中,基于2T tokens预训练的代码大语言模型Deepseek-coder-instruct以37%的准确率位居首位。分析表明,模型在数据分析与因果推理方面存在困难,且难以同时运用因果知识与给定数据。代码与数据见https://github.com/xxxiaol/QRData。