从SFT到RL：揭秘基于LLM的漏洞检测后训练流程 (From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection)

The integration of LLMs into vulnerability detection (VD) has shifted the field toward interpretable and context-aware analysis. While post-training methods have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, spanning from cold-start SFT to off-policy preference optimization and on-policy RL, uncovering how data curation, stage interactions, reward mechanisms, and evaluation protocols collectively dictate the efficacy of model training and assessment. Our study identifies practical guidelines and insights: (1) SFT based on rejection sampling greatly outperforms rationalization-based supervision, which can introduce hallucinations due to ground-truth leakage. (2) While increased SFT epochs constantly benefit preference optimization, excessive SFT inhibits self-exploration during RL, ultimately limiting performance gains. (3) Coarse-grained reward signals often mislead RL, whereas fine-grained root-cause judgments ensure reliable credit assignment. Specification-based rewards offer further benefits but incur significant effort in specification generation. (4) Although filtering extremely hard-to-detect vulnerability samples improves RL training efficiency, the cost of performance loss should be considered in practical applications. (5) Models trained under GRPO significantly outperform those using SFT and preference optimization (i.e., DPO and ORPO), as well as a series of zero-shot SOTA LLMs, underscoring the significant potential of on-policy RL for LLM-based VD. (6) In contrast to binary matching that tends to overestimate performance, LLM-as-a-Judge based on root-cause analysis provides a more robust evaluation protocol, although its accuracy varies across judge models with different levels of security expertise.

翻译：将大型语言模型（LLM）集成到漏洞检测（VD）领域，已推动该领域向可解释和上下文感知的分析方向转变。尽管后训练方法在通用编码任务中展现出潜力，但其在VD中的系统性应用仍待深入探索。本文首次对基于LLM的VD后训练流程进行全面研究，涵盖从冷启动监督微调（SFT）到离策略偏好优化及在策略强化学习（RL）的全过程，揭示了数据整理、阶段交互、奖励机制和评估协议如何共同决定模型训练与评估的有效性。我们的研究提出了实用指南与洞见：（1）基于拒绝采样的SFT显著优于基于合理化解释的监督方法，后者可能因真实标签泄露而引入幻觉。（2）虽然增加SFT训练轮次持续有益于偏好优化，但过度的SFT会抑制RL过程中的自我探索，最终限制性能提升。（3）粗粒度的奖励信号常误导RL，而细粒度的根因判断能确保可靠的信用分配。基于形式化规约的奖励可带来额外收益，但需投入大量精力生成规约。（4）尽管过滤极难检测的漏洞样本能提升RL训练效率，但在实际应用中需权衡性能损失的成本。（5）采用GRPO训练的模型显著优于使用SFT与偏好优化（如DPO和ORPO）的模型，以及一系列零样本SOTA LLM，凸显了在策略RL在基于LLM的VD中的巨大潜力。（6）与倾向于高估性能的二元匹配评估相比，基于根因分析的LLM-as-a-Judge提供了更稳健的评估协议，尽管其准确性因评审模型的安全专业知识水平而异。