The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
翻译:随着利用大语言模型处理复杂多步骤数据科学任务的需求快速增长,对精确基准测试的需求日益迫切。现有基准测试存在两大不足:(i)缺乏能够捕捉指令遵循与过程保真度的标准化、过程感知评估体系;(ii)精确标注的训练数据稀缺。为弥补这些不足,我们提出了DARE-bench,这是一个专为机器学习建模与数据科学指令遵循设计的基准测试。与许多依赖人工或模型评判的现有基准不同,DARE-bench中的所有任务均具备可验证的真实答案,确保了评估的客观性与可复现性。为覆盖广泛任务类型并支持智能体工具使用,DARE-bench包含6,300个源自Kaggle的任务,同时提供大规模训练数据集与评估集。广泛评估表明,即使是GPT-4-mini等高性能模型也难以取得理想表现,尤其在机器学习建模任务中。使用DARE-bench训练任务进行微调可显著提升模型性能:例如,监督微调使Qwen3-32B准确率提升1.83倍,强化学习使Qwen3-4B准确率提升超过8倍。这些显著改进验证了DARE-bench作为精确评估基准与关键训练数据的重要价值。