While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories--generating both secure "golden paths" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B's pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.
翻译:尽管大语言模型(LLMs)在代码生成方面表现出色,但其仍易复现训练数据中内嵌的细微却关键的安全漏洞。当前对齐技术(如监督微调与强化学习)通常采用序列层面的粗粒度优化,这种策略难以应对安全缺陷的局部性特征——单个错误标记选择便可能危及整个程序。为弥合这一鸿沟,我们提出树状自我博弈(TSP)框架,将安全代码生成重构为细粒度序列决策过程。有别于盲目最大化似然的标准方法,TSP构建决策树使模型探索分支轨迹——同时生成安全的"黄金路径"与易受攻击的变体。通过将代码生成视为自我博弈,模型学会严格区分自身局部错误。这提供了密集的在策略学习信号,强制模型在漏洞常现的关键决策节点精准执行自纠错。实验证明,TSP从根本上提升了模型可靠性。在Python安全基准测试中,TSP将CodeLlama-7B的通过率(SPR@1)提升至75.8%,显著优于监督微调(57.0%)与非结构化自我博弈基线。尤为关键的是,TSP诱导出鲁棒的跨分布泛化能力:模型不仅将未见类别(CWEs)漏洞降低24.5%,更能将从C/C++习得的安全原则成功迁移至Python、Go与JavaScript等多元语言。这表明TSP并非简单记忆补丁,而是内化了抽象且与语言无关的安全逻辑。