Understanding the world through models is a fundamental goal of scientific research. While large language model (LLM) based approaches show promise in automating scientific discovery, they often overlook the importance of criticizing scientific models. Criticizing models deepens scientific understanding and drives the development of more accurate models. Automating model criticism is difficult because it traditionally requires a human expert to define how to compare a model with data and evaluate if the discrepancies are significant--both rely heavily on understanding the modeling assumptions and domain. Although LLM-based critic approaches are appealing, they introduce new challenges: LLMs might hallucinate the critiques themselves. Motivated by this, we introduce CriticAL (Critic Automation with Language Models). CriticAL uses LLMs to generate summary statistics that capture discrepancies between model predictions and data, and applies hypothesis tests to evaluate their significance. We can view CriticAL as a verifier that validates models and their critiques by embedding them in a hypothesis testing framework. In experiments, we evaluate CriticAL across key quantitative and qualitative dimensions. In settings where we synthesize discrepancies between models and datasets, CriticAL reliably generates correct critiques without hallucinating incorrect ones. We show that both human and LLM judges consistently prefer CriticAL's critiques over alternative approaches in terms of transparency and actionability. Finally, we show that CriticAL's critiques enable an LLM scientist to improve upon human-designed models on real-world datasets.
翻译:通过模型理解世界是科学研究的基本目标。尽管基于大语言模型(LLM)的方法在自动化科学发现方面展现出潜力,但它们往往忽视了对科学模型进行批判的重要性。批判模型能够深化科学理解并推动更精确模型的发展。自动化模型批判是困难的,因为传统上需要人类专家定义如何将模型与数据进行比较,并评估差异是否显著——这两者都高度依赖于对建模假设和领域的理解。虽然基于LLM的批判方法具有吸引力,但它们引入了新的挑战:LLM可能会虚构批判内容。受此启发,我们提出了CriticAL(基于语言模型的批判自动化)。CriticAL利用LLM生成能够捕捉模型预测与数据之间差异的摘要统计量,并应用假设检验来评估这些差异的显著性。我们可以将CriticAL视为一个验证器,它通过将模型及其批判嵌入假设检验框架来验证它们。在实验中,我们从关键的定量和定性维度对CriticAL进行了评估。在人为合成模型与数据集之间差异的场景中,CriticAL能够可靠地生成正确的批判,而不会虚构错误的批判。我们表明,无论是人类评审员还是LLM评审员,在透明度和可操作性方面都一致倾向于选择CriticAL的批判,而非其他方法。最后,我们证明CriticAL的批判能够使一个LLM科学家在真实世界数据集上改进人工设计的模型。