CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. This observation further motivates our exploration of different interaction strategies: Mixed observation--intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge $F_1$. Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge $F_1$. We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

翻译：我们提出CausaLab，一个用于评估大语言模型（LLM）智能体进行交互式因果发现的可扩展环境。与既往评估不同，CausaLab同时评估智能体能否利用因果证据解决问题，以及其答案是否得到关于潜在因果机制正确假说的支持。每个场景将智能体置于一个合成实验室中：它接收先前的测量记录，对操纵器晶体进行干预，并预测由相同机制控制的隔离反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型（SCM），因此成功需要同时恢复因果图与结构方程，而非回忆先验知识。CausaLab还包含一种领域特定语言，用于记录智能体不断更新的SCM假说，使轨迹可检查且可与真实数据对比。实验表明，预测与机制恢复之间存在持续差距：在纯观测的6节点设定中，GPT-5.2-high的任务准确率达92%，但全边F1值仅为0.471。这一发现进一步激发了我们探索不同交互策略：混合观测-干预策略能提升结构保真度——在混合6节点设定中，GPT-5.2-high的任务准确率与全边F1值均达到80%。然而，即便是强智能体也难以设计有效的干预方案，因为纯干预策略在任务准确率与全边F1值上均表现不佳。我们识别出过早停止是智能体的主要缺陷，并证明要求模型验证其假说与历史数据的一致性有助于缓解该问题。因此，CausaLab将预测成功与因果理解相分离，揭示了当前LLM智能体作为实验性因果推理者的局限性。