VulInstruct: Teaching LLMs Root-Cause Reasoning for Vulnerability Detection via Security Specifications

Large language models (LLMs) have achieved remarkable progress in code understanding tasks. However, they demonstrate limited performance in vulnerability detection and struggle to distinguish vulnerable code from patched code. We argue that LLMs lack understanding of security specifications -- the expectations about how code should behave to remain safe. When code behavior differs from these expectations, it becomes a potential vulnerability. However, such knowledge is rarely explicit in training data, leaving models unable to reason about security flaws. We propose VulInstruct, a specification-guided approach that systematically extracts security specifications from historical vulnerabilities to detect new ones. VulInstruct constructs a specification knowledge base from two perspectives: (i) General specifications from high-quality patches across projects, capturing fundamental safe behaviors; and (ii) Domain-specific specifications from repeated violations in particular repositories relevant to the target code. VulInstruct retrieves relevant past cases and specifications, enabling LLMs to reason about expected safe behaviors rather than relying on surface patterns. We evaluate VulInstruct under strict criteria requiring both correct predictions and valid reasoning. On PrimeVul, VulInstruct achieves 45.0% F1-score (32.7% improvement) and 37.7% recall (50.8% improvement) compared to baselines, while uniquely detecting 24.3% of vulnerabilities -- 2.4x more than any baseline. In pair-wise evaluation, VulInstruct achieves 32.3% relative improvement. VulInstruct also discovered a previously unknown high-severity vulnerability (CVE-2025-56538) in production code, demonstrating practical value for real-world vulnerability discovery. All code and supplementary materials are available at https://github.com/zhuhaopku/VulInstruct-temp.

翻译：大语言模型（LLMs）在代码理解任务中取得了显著进展。然而，它们在漏洞检测方面表现有限，难以区分脆弱代码与修补后代码。我们认为，LLMs缺乏对安全规约的理解——即代码应如何行为以保持安全的预期。当代码行为偏离这些预期时，便可能成为潜在漏洞。然而，此类知识在训练数据中鲜有显式体现，导致模型无法推理安全缺陷。为此，我们提出VulInstruct，一种规约引导的方法，系统地通过历史漏洞提取安全规约来检测新漏洞。VulInstruct从两个视角构建规约知识库：（i）跨项目高质量补丁中提取通用规约，捕获基础的安全行为模式；（ii）从与目标代码相关的特定仓库中反复违反的案例提取领域特定规约。VulInstruct检索相关历史案例与规约，使LLMs能够推理预期安全行为，而非依赖表面模式。我们在需要同时满足正确预测与有效推理的严格准则下评估VulInstruct。在PrimeVul数据集上，VulInstruct的F1分数达45.0%（提升32.7%），召回率达37.7%（提升50.8%），且能唯一检测出24.3%的漏洞——是最佳基线的2.4倍。在成对评估中，VulInstruct实现32.3%的相对改进。此外，VulInstruct在生产代码中发现了一个此前未知的高危漏洞（CVE-2025-56538），展示了其在真实世界漏洞发现中的实用价值。所有代码与补充材料见https://github.com/zhuhaopku/VulInstruct-temp。