Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.
翻译:Lean正越来越多地被用于评判自然语言数学答案的正确性,但其信号存在局限性:许多答案无法形式化,而证明失败可能源于类型错误或缺少库函数,而非答案本身错误。在MATH-500数据集上的实验表明,该信号(i)显著依赖覆盖率——在证明覆盖率较高时,通过证明的答案正确率达96%,但在覆盖率较低时仅为20%;(ii)稀疏且常不可靠:一个70亿参数的自动形式化器仅能对28%的问题完成证明,而人工审核发现其中仅约43%的证明是可靠的。为此,我们提出COVCAL方法——一种基于Lean追踪诊断的选择器,能在有限样本下对已接受答案提供选择性风险保证(要么认证答案,要么拒绝回答)。该方法支持两种机制:保守的Bonferroni界和更严格的开发-校准规则。其可行性取决于自动形式化覆盖率:使用70亿参数形式化器时信号过于稀疏,Bonferroni方法在所有20个自助采样分区上均拒绝回答;而使用专用证明器形式化器时覆盖率提升至79%,在20个分区中的17个上实现可行,以0.98的接受准确率接受了约48%的问题。由于仅靠自一致性方法已达91%的准确率,本研究的贡献在于精确阐明:在风险控制条件下,何时以及使用何种形式化器才能信任部分形式化信号。