How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured \emph{on average} over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train *Self-Proving models* that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. Self-Proving models satisfy that, with high probability over a random input, the model generates a correct output \emph{and} successfully proves its correctness to $V\!$. The *soundness* property of $V$ guarantees that, for *every* input, no model can convince $V$ of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while *all* incorrect outputs (of any model) are detected by $V$. We devise a generic method for learning Self-Proving models, and we prove convergence bounds under certain assumptions. The theoretical framework and results are complemented by experiments on an arithmetic capability: computing the greatest common divisor (GCD) of two integers. Our learning method is used to train a Self-Proving transformer that computes the GCD *and* proves the correctness of its answer.
翻译:我们如何确信一个学习模型在特定关注输入上的正确性?模型准确性通常是在输入分布上的*平均*度量,无法为任何固定输入提供保证。本文针对该问题提出一种理论基础的解决方案:训练*自我验证模型*,使其能够通过交互式证明向验证算法$V$证明其输出的正确性。自我验证模型满足以下特性:在随机输入上以高概率生成正确输出*并*成功向$V$证明其正确性。验证算法$V$的*可靠性*特性保证:对于*任意*输入,任何模型都无法使$V$接受错误输出的正确性证明。因此,自我验证模型能够证明其大部分输出的正确性,同时$V$可以检测出*所有*错误输出(来自任何模型)。我们设计了一种通用的自我验证模型学习方法,并在特定假设下证明了收敛界。理论框架与结果通过算术能力实验得到补充:计算两个整数的最大公约数(GCD)。我们的学习方法被用于训练一个自我验证Transformer模型,该模型既能计算GCD*又能*证明其答案的正确性。