Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models' ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems. Our data and code are publicly available here: https://github.com/jeremyxianx/Assisted-DS
翻译:大语言模型(LLMs)推动了数据科学工作流的自动化进程。然而,它们能否像人类数据科学家在实践中那样批判性地利用外部领域知识,目前尚不明确。为回答这一问题,我们提出了AssistedDS(辅助数据科学)基准测试,旨在系统评估LLMs在表格预测任务中处理领域知识的能力。AssistedDS包含具有明确已知生成机制的合成数据集和真实世界的Kaggle竞赛数据集,每个数据集均配有精心策划的有益文档和对抗性文档集合。这些文档提供了关于数据清洗、特征工程和模型选择的领域特定见解。我们评估了最先进的LLMs在辨别和应用有益与有害领域知识方面的能力,考察指标包括提交有效性、信息召回率和预测性能。我们的实验结果揭示了三个关键发现:(1)LLMs经常表现出对所提供信息不加批判的采纳,当引入对抗性内容时,其预测性能会显著受损;(2)有益指导往往不足以抵消对抗性信息的负面影响;(3)在处理Kaggle数据集时,LLMs常在处理时间序列数据、在不同数据折间应用一致的特征工程以及正确解释分类变量方面出现错误。这些发现凸显了当前模型在批判性评估和利用专家知识方面存在显著差距,指明了开发更稳健、具备知识感知能力的自动化数据科学系统的重要研究方向。我们的数据和代码已公开在此处:https://github.com/jeremyxianx/Assisted-DS