The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. These results underscore the limited capacities of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.
翻译:语言模型(LLM)的进展引发了人们对开发基于LLM的语言智能体以端到端自动化科学发现的日益增长的兴趣,这既激发了对此类智能体真实能力的期待,也引发了质疑。在本工作中,我们认为,一个智能体若要完全自动化科学发现,必须能够完成工作流程中的所有核心任务。因此,我们呼吁在对端到端自动化做出大胆断言之前,先对智能体在科学工作流程中的单项任务进行严谨评估。为此,我们提出了ScienceAgentBench,这是一个用于评估数据驱动科学发现语言智能体的新基准。为确保基准的科学真实性与现实相关性,我们从四个学科的44篇同行评审出版物中提取了102项任务,并邀请了九位领域专家对其进行验证。我们将每项任务的目标输出统一为自包含的Python程序文件,并采用一系列评估指标来检查生成的程序、执行结果及成本。每项任务均经过标注员和领域专家的多轮人工验证,以确保其标注质量和科学合理性。我们还提出了两种有效策略以缓解数据污染问题。利用该基准,我们评估了五个开源及专有LLM,每个模型采用三种框架:直接提示、OpenHands和自调试。在每项任务允许三次尝试的条件下,表现最佳的智能体仅能独立解决32.4%的任务,在专家提供知识的情况下也只能解决34.3%的任务。这些结果凸显了当前语言智能体在生成数据驱动发现代码方面的有限能力,更遑论实现科学研究的端到端自动化。