Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.
翻译:摘要:允许大语言模型直接控制代价高昂且不可逆的科学实验会导致不安全的探索和不稳定的性能,但完全舍弃大语言模型的创造力则会牺牲巨大的优化潜力。我们提出CARE(通过科学实验证据的可审计审查控制大语言模型生成的策略),这是一种用于高通量实验优化的可审计控制器,它保留非大语言模型现有优化器作为默认动作路径,同时利用大语言模型修订挑战者排序策略。在每个实验结果揭晓前,公共证据干预门将挑战者与现有方案进行比较。仅当选择前可获得的证据支持该变更时,才授权选择挑战者,并将决策记录在审计日志中。在Minerva/Olympus和ChemLex基准上,CARE优于所有其他评估方法,与公共现有方案相比,在Minerva/Olympus上的最终最优值从80.0提升至88.5,在ChemLex上从83.9提升至92.1。我们的实验表明,当大语言模型在可审计控制器下扩展提议空间而非直接选择实验时,其自我进化更为可靠。