Large language models (LLMs) have recently shown strong potential in vulnerability detection (VD). However, accurately detecting vulnerabilities in real-world repositories requires reasoning over complex contextual interactions. Existing LLM-based VD approaches remain limited because current datasets lack complete contextual information and high-quality reasoning supervision, while existing optimization methods primarily rely on coarse outcome-centric supervision signals that fail to model the vulnerability reasoning process. To address these limitations, we first construct ContextVul, a new dataset that augments high-quality function-level vulnerability benchmarks with repository-level contextual information and curated vulnerability reasoning traces. Building upon ContextVul, we introduce a two-stage optimization framework consisting of lightweight cold-start supervised fine-tuning followed by vulnerability-adaptive on-policy optimization (VULPO). VULPO incorporates multidimensional rewards that jointly evaluate vulnerability identification, vulnerability-relevant localization, and causal reasoning quality, along with difficulty-adaptive reward scaling to mitigate reward hacking and improve RL effectiveness. Extensive experiments demonstrate the superiority of VULPO for context-aware VD. Our VULPO-4B, the first specialized vulnerability reasoning LLM, substantially outperforms existing VD baselines, improving Pairwise Pass@1 by 203% relative to Qwen3-4B and achieving competitive performance against a 150% larger-scale LLM, DeepSeek-V3.1.
翻译:大语言模型(LLMs)近期在漏洞检测(VD)方面展现出巨大潜力。然而,在真实代码仓库中准确检测漏洞需要推理复杂的上下文交互关系。现有基于LLM的漏洞检测方法仍存在局限:当前数据集缺乏完整上下文信息与高质量推理监督,且现有优化方法主要依赖粗粒度的结果中心监督信号,无法建模漏洞推理过程。为解决上述问题,我们首先构建了ContextVul数据集——该新数据集通过引入仓库级上下文信息和精心策划的漏洞推理轨迹,增强了高质量函数级漏洞基准。基于ContextVul,我们提出两阶段优化框架:轻量级冷启动监督微调,而后进行漏洞自适应在线策略优化(VULPO)。VULPO采用多维奖励机制,联合评估漏洞识别、漏洞相关定位及因果推理质量,并结合难度自适应奖励缩放以缓解奖励黑客问题并提升强化学习效果。大量实验证明了VULPO在上下文感知漏洞检测中的优越性。我们推出的首个专用漏洞推理大模型VULPO-4B显著优于现有漏洞检测基线:与Qwen3-4B相比,Pairwise Pass@1提升203%,同时与规模大150%的DeepSeek-V3.1模型性能相当。