PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy

While recent approaches leverage large language models (LLMs) and multi-agent pipelines to automatically generate proof-of-concept (PoC) exploits from vulnerability reports, existing systems often suffer from two fundamental limitations: unreliable validation based on surface-level execution signals and high operational cost caused by extensive trial-and-error during exploit generation. In this paper, we present PoC-Adapt, an end-to-end framework for automated PoC generation and verification, architected upon a foundation semantic runtime validation and adaptive policy learning. At the core of PoC-Adapt is a Semantic Oracle that validates exploits by comparing structured pre- and post-execution system states, enabling reliable distinction between true vulnerability exploitation and incidental behavioral changes. To reduce exploration cost, we further introduce an Adaptive Policy Learning mechanism that learns an exploitation policy over semantic states and actions, guiding the exploit agent toward effective strategies with fewer failed attempts. PoC-Adapt is implemented as a multi-agent system comprising specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, coordinated through structured feedback loops. Experimenting on the CWE-Bench-Java and PrimeVul benchmarks shows that PoC-Adapt significantly improves verification reliability by 25% and reduces exploit generation cost compared to prior LLM-based systems, highlighting the importance of semantic validation and learned action policies in automated vulnerability reproduction. Applied to the latest CVE corpus, PoC-Adapt confirmed 12 verified PoC out of 80 reproduce attempts at a cost of $0.42 per generated exploit

翻译：尽管近期研究利用大语言模型（LLM）和多智能体流水线，从漏洞报告中自动生成概念验证（PoC）漏洞利用代码，但现有系统普遍存在两个根本性局限：基于表层执行信号的不可靠验证，以及漏洞利用生成过程中大量试错导致的高运行成本。本文提出PoC-Adapt——一个面向自动化PoC生成与验证的端到端框架，其架构建立在语义运行时验证与自适应策略学习基础之上。PoC-Adapt的核心是语义判别器（Semantic Oracle），通过对比结构化的执行前与执行后系统状态来验证漏洞利用代码，能够可靠区分真实漏洞利用与偶发性行为变化。为降低探索成本，我们进一步引入自适应策略学习机制，该机制通过对语义状态与动作进行漏洞利用策略学习，引导漏洞利用智能体以更少的失败尝试找到有效策略。PoC-Adapt实现为一个多智能体系统，包含专门负责根因分析、环境搭建、漏洞利用生成与语义验证的智能体，并通过结构化反馈循环进行协调。在CWE-Bench-Java和PrimeVul基准上的实验表明，与先前基于LLM的系统相比，PoC-Adapt将验证可靠性提升了25%，并降低了漏洞利用生成成本，凸显了语义验证与学习型动作策略在自动化漏洞复现中的重要性。将其应用于最新CVE语料库后，PoC-Adapt在80次复现尝试中确认了12个验证有效的PoC，每次生成的漏洞利用代码平均成本为0.42美元。