In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. In this paper, we introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents throughout the entire data science lifecycle. Incorporating a novel bootstrapped annotation method, we streamline dataset preparation, improve the evaluation coverage, and expand benchmarking comprehensiveness. Our findings uncover prevalent obstacles and provide critical insights to inform future advancements in the field.
翻译:在数据驱动的决策时代,数据分析的复杂性要求具备数据科学领域的高级专业知识与工具,这即使对专业人士而言也构成重大挑战。大型语言模型(LLM)作为数据科学智能体展现出辅助潜力,能够协助人类完成数据分析和处理任务。然而,受限于真实应用场景的多样化需求与复杂的分析流程,其实际效能仍受到制约。本文提出DSEval——一种新型评估范式,以及一系列用于评估数据科学智能体在整个数据科学生命周期中性能的创新基准。通过引入创新的自助标注方法,我们简化了数据集制备流程,提升了评估覆盖范围,并拓展了基准测试的全面性。研究结果揭示了普遍存在的障碍,并为该领域的未来发展提供了关键见解。