Data science is an integrated workflow of technical, analytical, communication, and ethical skills, but current AI benchmarks focus mostly on constituent parts. We test whether AI models can generate end-to-end data science projects. To do this we create a benchmark of 40 end-to-end data science projects with associated rubric evaluations. We use these to build an automated grading pipeline that systematically evaluates the data science projects produced by generative AI models. We find the extent to which generative AI models can complete end-to-end data science projects varies considerably by model. Most recent models did well on structured tasks, but there were considerable differences on tasks that needed judgment. These findings suggest that while AI models could approximate entry-level data scientists on routine tasks, they require verification.
翻译:数据科学是技术、分析、沟通与伦理技能的综合工作流,但现有AI基准大多关注其组成部分。我们测试了AI模型能否生成端到端数据科学项目。为此,我们创建了包含40个端到端数据科学项目的基准集及相应量规评估体系。基于此构建了自动化评分流程,系统评估生成式AI模型产出的数据科学项目。研究发现,生成式AI模型完成端到端数据科学项目的能力因模型差异显著。最新模型在结构化任务中表现良好,但在需要判断力的任务上存在明显差距。这些发现表明,虽然AI模型在常规任务上可接近初级数据科学家水平,但仍需人工验证。