Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software

Developers create modern software applications (Apps) on top of third-party libraries (Libs). When library vulnerabilities are reachable through application code, the applications can be vulnerable to software supply chain attacks. Prior work shows that developers often require concrete and executable evidence, i.e., proof-of-vulnerability (PoV) tests, to decide whether a reported dependency vulnerability poses a practical security risk to their application. However, manually crafting such tests is challenging, and existing tool support is insufficient to automate the procedure. To streamline test generation, we created PoVSmith -- a new approach that combines call path analysis, exemplar test, code context, and feedback into multiple prompts to guide a coding agent (i.e., Codex) and a large language model (i.e., GPT) for test generation, execution, and assessment. We evaluated PoVSmith on 33 $\langle$App, Lib$\rangle$ Java program pairs, where each App depends on a vulnerable Lib. PoVSmith revealed 158 unique application-level entry points (i.e., public methods) calling vulnerable library APIs; 152 (96\%) of them were correctly found, together with the call paths properly recognized. With such method call information, PoVSmith generated 152 tests, 84 (55\%) of which demonstrated feasible ways of attacking Apps by exploiting Lib vulnerabilities. PoVSmith substantially outperforms the state-of-the-art LLM-based approach, as it reduces human involvement while dramatically improving test quality. Our work contributes (1) a novel approach of agent-based test generation, (2) an iterative code refinement process driven by execution feedback, and (3) LLM-based quality assessment grounded in both the test context and execution logs.

翻译：开发者在第三方库（Libs）之上构建现代软件应用（Apps）。当库中的漏洞可通过应用程序代码被利用时，应用程序可能面临软件供应链攻击风险。先前研究表明，开发者通常需要具体且可执行证据，即证明漏洞测试（PoV tests），来判断已报告的依赖项漏洞是否对其应用程序构成实际安全威胁。然而，手动编写此类测试具有挑战性，现有工具支持不足以自动化该流程。为简化测试生成，我们创建了PoVSmith——一种结合调用路径分析、示例测试、代码上下文与反馈的新型方法，通过多提示机制引导编码智能体（即Codex）和大语言模型（即GPT）进行测试生成、执行与评估。我们在33对$\langle$应用，库$\rangle$Java程序对上评估了PoVSmith，其中每个应用依赖存在漏洞的库。PoVSmith揭示了158个调用漏洞库API的应用程序级入口点（即公共方法）；其中152个（96%）被正确识别，且调用路径被准确解析。基于此类方法调用信息，PoVSmith生成了152个测试，其中84个（55%）成功演示了利用库漏洞攻击应用的可执行方案。PoVSmith显著优于基于大语言模型的最新方法，在减少人工参与的同时大幅提升测试质量。本文贡献包括：（1）基于智能体的测试生成新方法，（2）由执行反馈驱动的迭代代码优化流程，（3）结合测试上下文与执行日志的大语言模型质量评估机制。