Identifying vulnerability-fixing commits corresponding to disclosed CVEs is essential for secure software maintenance but remains challenging at scale, as large repositories contain millions of commits of which only a small fraction address security issues. Existing automated approaches, including traditional machine learning techniques and recent large language model (LLM)-based methods, often suffer from poor precision-recall trade-offs. Frequently evaluated on randomly sampled commits, we uncover that they are substantially underestimating real-world difficulty, where candidate commits are already security-relevant and highly similar. We propose Favia, a forensic, agent-based framework for vulnerability-fix identification that combines scalable candidate ranking with deep and iterative semantic reasoning. Favia first employs an efficient ranking stage to narrow the search space of commits. Each commit is then rigorously evaluated using a ReAct-based LLM agent. By providing the agent with a pre-commit repository as environment, along with specialized tools, the agent tries to localize vulnerable components, navigates the codebase, and establishes causal alignment between code changes and vulnerability root causes. This evidence-driven process enables robust identification of indirect, multi-file, and non-trivial fixes that elude single-pass or similarity-based methods. We evaluate Favia on CVEVC, a large-scale dataset we made that comprises over 8 million commits from 3,708 real-world repositories, and show that it consistently outperforms state-of-the-art traditional and LLM-based baselines under realistic candidate selection, achieving the strongest precision-recall trade-offs and highest F1-scores.
翻译:识别与已披露CVE对应的漏洞修复提交对于安全软件维护至关重要,但在大规模场景下仍具挑战性,因为大型代码库包含数百万次提交,其中仅极小部分涉及安全问题。现有自动化方法(包括传统机器学习技术和近期基于大语言模型的方法)常面临精确率与召回率的权衡困境。现有研究多在随机采样的提交上进行评估,我们发现这严重低估了实际场景的难度——实际候选提交本身已具备安全相关性且高度相似。本文提出Favia,一种基于取证代理的漏洞修复识别框架,结合可扩展的候选排序与深度迭代语义推理。Favia首先通过高效排序阶段缩小提交搜索空间,随后采用基于ReAct的大语言模型代理对每个提交进行严格评估。通过向代理提供提交前代码库作为环境及专用工具,代理尝试定位脆弱组件、遍历代码库,并建立代码修改与漏洞根源之间的因果关联。这种证据驱动的流程能够稳健识别间接修复、多文件修复及复杂修复等单次扫描或基于相似性方法难以捕捉的案例。我们在CVEVC数据集上评估Favia(该数据集包含来自3,708个真实代码库的超过800万次提交),结果表明在现实候选选择场景下,Favia持续优于最先进的传统方法与基于大语言模型的基线,实现了最优的精确率-召回率权衡及最高的F1分数。