As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness -- a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators.
翻译:随着大型语言模型能力日益增强,确保其输出能够被能力较低的系统轻松验证变得至关重要。证明者-验证者博弈可用于提升模型输出的可验证性,但与仅以最大化正确性为目标训练的基线模型相比,其准确性会出现下降——这一现象被称为可读性代价。我们提出一种解决方案:将正确性条件与可验证性条件解耦,转而训练一个"翻译器"模型,将固定求解器模型的解转换为可验证形式。这使得我们可以先训练求解器以最大化正确性,再训练翻译器将求解器的解转换为可验证形式,同时保留求解器的答案。为适应这种新的翻译目标,我们构建了一种解耦证明者-验证者博弈,其均衡态对应着忠实且可验证的翻译器。