Verification is becoming central to both reinforcement-learning-based training and inference-time control of large language models (LLMs). Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable verifiers and selects a compact verifier set whose joint satisfaction best approximates the target objective. Across mathematical reasoning, coding, function calling, and instruction-following benchmarks for several state-of-the-art LLMs, AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points over the initial LLM-generated verifier sets. Additional analyses show that the most useful verification targets vary by benchmark and model, and that the DAG-based search shifts the learned verifier sets toward more structural and semantically grounded checks. We further show that exposing the discovered verifier set to an LLM as an external tool improves downstream accuracy by up to 17.0 points. We release our code
翻译:[translated abstract in Chinese]
验证正成为强化学习训练和大语言模型(LLMs)推理时控制的核心。然而,当前的验证器面临一个根本性权衡:基于LLM的验证器表达能力丰富但难以控制且易出错,而确定性可执行验证器可靠且可解释但能力有限。我们研究如下问题:给定一个针对目标目标(如正确性)的LLM输出及其标签的开发集,能否自动归纳出一组最简的Python验证器,使其联合满足条件紧密匹配该目标?我们提出AutoPyVerifier框架,该框架利用LLM合成候选验证函数,然后通过在有向无环图(DAG)上的搜索对其进行精炼。通过导航DAG,AutoPyVerifier系统性地探索确定性可执行验证器的空间,并选择一组紧凑的验证器,使其联合满足条件最佳逼近目标目标。在多个最先进LLM的数学推理、编码、函数调用和指令遵循基准测试上,AutoPyVerifier相比初始LLM生成的验证器集,在目标目标预测上最多提升55.0个F1分数点。进一步分析表明,最有用的验证目标因基准和模型而异,且基于DAG的搜索将学习到的验证器集转向更结构化和语义更扎实的检查。我们进一步证明,将发现的验证器集作为外部工具暴露给LLM,可将下游准确率提升多达17.0个百分点。我们已公开发布代码。