ATTAIN: Automated Exploit Failure Analysis through Trace-Driven Diff Analysis

Exploits are widely used to check whether library vulnerabilities appear in different versions and to mark affected version ranges. Exploit-based checks sometimes fail because exploits stop running on many versions after API or environment changes. Commit-based methods, such as SZZ-style analysis, sometimes miss the right introduce commits and spread labels incorrectly along long version chains. These problems leave many affected versions unlabeled or wrongly labeled and make manual exploit failure analysis very expensive and impractical at scale. We present ATTAIN, a trace-driven diff analysis framework with three modules to assess vulnerability presence across evolving library versions. The modules are trace construction, diff exploration, and affected-version judgment. The trace construction module executes an exploit across historical library versions and compares their behaviors to capture cross-version execution divergences. Using these divergences, the diff exploration module guides an LLM through a finite-state tool loop to autonomously search over version changes and collect vulnerability-relevant diff hunks. The affected-version judgment module reasons over the collected evidence to determine whether the vulnerability exists in each version and outputs the affected version range. We evaluate ATTAIN on an extensive dataset comprising 224 CVEs and 25,943 library versions across 128 libraries. ATTAIN achieves an F1-score of 93.24%, outperforming the commit-based methods V-SZZ and LLM4SZZ by 116.28% and 33.30%, respectively. ATTAIN uses short tool-guided prompts and a fixed number of iterations, keeping token usage low. It matches or surpasses existing methods on frequent CWE types, including cases where exploit runs fail for non-vulnerability reasons or commit messages do not clearly delimit affected versions.

翻译：漏洞利用被广泛用于检查库漏洞是否出现在不同版本中，并标记受影响的版本范围。由于API或环境变化导致漏洞利用在多个版本上停止运行，基于漏洞利用的检查有时会失败。基于提交的方法（如SZZ风格分析）有时会遗漏正确的引入提交，并沿长版本链错误地传播标签。这些问题使得许多受影响版本未被标记或被错误标记，导致人工漏洞利用分析成本高昂且难以大规模实施。我们提出ATTAIN——一个追踪驱动的差异分析框架，包含三个模块来评估库版本演进过程中漏洞的存在性。这三个模块是追踪构建、差异探索和受影响版本判断。追踪构建模块在历史库版本上执行漏洞利用，并比较其行为以捕获跨版本执行差异。利用这些差异，差异探索模块引导大语言模型（LLM）通过有限状态工具循环自主搜索版本变更，并收集与漏洞相关的差异代码片段。受影响版本判断模块对收集到的证据进行推理，以确定每个版本中是否存在漏洞，并输出受影响的版本范围。我们在涵盖128个库的224个CVE和25,943个库版本的大规模数据集上评估ATTAIN。ATTAIN的F1分数达到93.24%，分别比基于提交的方法V-SZZ和LLM4SZZ高出116.28%和33.30%。ATTAIN使用短工具引导提示和固定迭代次数，保持较低的令牌使用量。在常见CWE类型上，包括漏洞利用因非漏洞原因而运行失败或提交消息未明确界定受影响版本的场景，其性能达到或超越现有方法。