AI-Scientist systems that use large language models to automate research risk generating spurious discoveries through uncontrolled multiple testing. We present a functional architecture that enforces statistical rigor at two levels: a Haskell embedded DSL (the Research monad) that makes it impossible to test a hypothesis without updating the error budget, and a declarative scaffolding technique that structurally prevents data leakage across the boundary into LLM-generated code. We ground these guarantees in a machine-checked Lean 4 formalization of the LORD++ online FDR control theorem (855 lines, zero sorry), which identifies four sufficient conditions for FDR control. Three are structural conditions -- about information flow, data separation, and test validity -- enforced by the architecture's type system and scaffolding. The fourth is an arithmetic condition: a budget invariant requiring that a wealth process remain non-negative under floating-point computation. We verify this condition over IEEE 754 doubles using SPARK/Ada, whose GNATprove toolchain statically confirms that no rounding sequence can violate the invariant and whose flow analysis independently confirms the predictability condition. The resulting verification chain -- from real-analysis proof to floating-point implementation -- is, to our knowledge, the first for any online FDR control procedure. Monte Carlo simulation (N=2000 hypotheses) and an end-to-end case study confirm that the monadic implementation controls FDR at 1.1% against a 5% target, while a naive approach inflates to 41%.
翻译:利用大型语言模型自动化研究的AI科学家系统,因不受控的多重检验而面临产生伪发现的风险。本文提出一种功能架构,在两个层面强制实施统计严谨性:一是通过Haskell嵌入式领域特定语言(Research单子),使得在不更新误差预算的情况下检验假设成为不可能;二是通过声明式脚手架技术,从结构上防止数据跨越边界泄漏至LLM生成的代码中。我们将这些保证建立在LORD++在线错误发现率控制定理的机器可验证Lean 4形式化基础上(855行代码,零"sorry"标注),该形式化识别了错误发现率控制的四个充分条件。其中三个为结构性条件——涉及信息流、数据分离和检验有效性——由架构的类型系统和脚手架强制执行。第四个是算术条件:要求财富过程在浮点计算下保持非负的预算不变式。我们使用SPARK/Ada验证该条件对IEEE 754双精度浮点数的适用性,其GNATprove工具链静态确认了任何舍入序列都不会违反不变式,其流分析独立确认了可预测性条件。据我们所知,由此形成的验证链条——从实分析证明到浮点实现——是首个针对在线错误发现率控制程序的完整验证。蒙特卡洛模拟(N=2000个假设)和端到端案例研究证实,单子实现将错误发现率控制在1.1%(目标值为5%),而朴素方法则膨胀至41%。