TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.
翻译:TLA+是一种用于验证分布式系统与安全关键协议的正式规约语言。大型语言模型生成的TLA+规约常因语义原因无法通过TLC模型检查器。在25个LLM中,最佳公开基线的语法解析成功率为26.6%,语义模型检查成功率为8.6%。我们提出TLA-Prover——一个200亿参数的TLA+规约综合模型。训练过程结合了基于已验证示例的监督微调与基于修复的组相对策略优化。在GRPO阶段,模型学习修正自身被拒绝的规约。我们还从同一SFT检查点训练了直接偏好优化变体作为消融实验。TLC直接提供奖励信号,无需学习奖励模型。每个输出按四个等级分级:青铜(通过解析)、白银(无警告)、黄金(通过TLC)和钻石。要达到钻石等级,模型的正确性属性需被自动小幅度修改;TLC必须检测到违规行为。若TLC仍通过,该属性为恒真且无贡献,输出不能达到钻石等级。在保留的30问题基准测试中,TLA-Prover在黄金和钻石等级均达到9/30(即pass@1=30%),约为未调优基线8.6%的3.5倍。DPO变体在钻石等级达到20%。所有检查点的黄金与钻石等级一致,这防止了平凡属性失败模式。