Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models

Large Language Models (LLMs) have demonstrated remarkable proficiency in vulnerability detection. However, a critical reliability gap persists: models frequently yield correct detection verdicts based on hallucinated logic or superficial patterns that deviate from the actual root cause. This misalignment remains largely obscured because contemporary benchmarks predominantly prioritize coarse-grained classification metrics, lacking the granular ground truth required to evaluate the underlying reasoning process. To bridge this gap, we first construct a benchmark consisting of two datasets: (1) real-world vulnerabilities with expert-curated causal reasoning as ground truth, and (2) semantically equivalent code perturbations for assessing reasoning robustness. Our large-scale empirical study reveals that even state-of-the-art models struggle to maintain logical consistency during semantic code comprehension, exhibiting 12 systematic failure patterns. Addressing these limitations, we propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task. Unlike linear chain-of-thought (CoT), our approach explicitly maps causal dependencies to enforce structural consistency. By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic. Experimental results demonstrate that our framework improves the reasoning F1-score by an average of 18.9% over all the baselines. Remarkably, our 8B-parameter implementation not only outperforms existing models of comparable scale but also surpasses specialized large-scale reasoning models, including Qwen3-30B-Reasoning and GPT-OSS-20B-High. It is even competitive with state-of-the-art models like Claude-Sonnet-4.5 (75.47% vs. 76.11%), establishing new efficiency in vulnerability reasoning across model scales.

翻译：大语言模型（LLMs）在漏洞检测方面已展现出卓越能力。然而，一个关键的可靠性缺陷依然存在：模型经常基于偏离实际根本原因的幻觉逻辑或表面模式给出正确的检测结论。这种错位在很大程度上被掩盖，因为当前基准测试主要关注粗粒度的分类指标，缺乏评估底层推理过程所需的细粒度真实数据。为弥补这一差距，我们首先构建了一个包含两个数据集的基准测试：（1）以专家构建的因果推理作为真实数据的真实世界漏洞数据集，（2）用于评估推理鲁棒性的语义等价代码扰动数据集。我们的大规模实证研究表明，即使是最先进的模型在语义代码理解过程中也难以保持逻辑一致性，表现出12种系统性故障模式。针对这些局限性，我们提出了DAGVul——一种将漏洞推理建模为有向无环图（DAG）生成任务的新框架。与线性思维链（CoT）不同，我们的方法显式映射因果依赖关系以强制保持结构一致性。通过进一步引入可验证奖励的强化学习（RLVR），我们将模型推理轨迹与程序内在逻辑对齐。实验结果表明，我们的框架在所有基线模型上将推理F1分数平均提升了18.9%。值得注意的是，我们80亿参数的实现不仅超越了同等规模的所有现有模型，甚至超越了包括Qwen3-30B-Reasoning和GPT-OSS-20B-High在内的专业大规模推理模型。其表现与Claude-Sonnet-4.5等最先进模型具有可比性（75.47% vs. 76.11%），在不同模型规模上建立了漏洞推理的新效率标杆。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

通过逻辑推理赋能大语言模型：综述

专知会员服务

32+阅读 · 2025年2月24日

大语言模型中的逻辑推理：综述

专知会员服务

48+阅读 · 2025年2月15日

大规模语言模型推理的进展综述

专知会员服务

57+阅读 · 2025年2月8日