Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories--generating both secure "golden paths" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B's pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.

翻译：尽管大语言模型（LLMs）在代码生成方面表现出色，但其仍易复现训练数据中内嵌的细微却关键的安全漏洞。当前对齐技术（如监督微调与强化学习）通常采用序列层面的粗粒度优化，这种策略难以应对安全缺陷的局部性特征——单个错误标记选择便可能危及整个程序。为弥合这一鸿沟，我们提出树状自我博弈（TSP）框架，将安全代码生成重构为细粒度序列决策过程。有别于盲目最大化似然的标准方法，TSP构建决策树使模型探索分支轨迹——同时生成安全的"黄金路径"与易受攻击的变体。通过将代码生成视为自我博弈，模型学会严格区分自身局部错误。这提供了密集的在策略学习信号，强制模型在漏洞常现的关键决策节点精准执行自纠错。实验证明，TSP从根本上提升了模型可靠性。在Python安全基准测试中，TSP将CodeLlama-7B的通过率（SPR@1）提升至75.8%，显著优于监督微调（57.0%）与非结构化自我博弈基线。尤为关键的是，TSP诱导出鲁棒的跨分布泛化能力：模型不仅将未见类别（CWEs）漏洞降低24.5%，更能将从C/C++习得的安全原则成功迁移至Python、Go与JavaScript等多元语言。这表明TSP并非简单记忆补丁，而是内化了抽象且与语言无关的安全逻辑。