Recently Large Language Models (LLMs) have been used in security vulnerability detection tasks including generating proof-of-concept (PoC) exploits. A PoC exploit is a program used to demonstrate how a vulnerability can be exploited. Several approaches suggest that supporting LLMs with additional guidance can improve PoC generation outcomes, motivating further evaluation of their effectiveness. In this work, we develop PoC-Gym, a framework for PoC generation for Java security vulnerabilities via LLMs and systematic validation of generated exploits. Using PoC-Gym, we evaluate whether the guidance from static analysis tools improves the PoC generation success rate and manually inspect the resulting PoCs. Our results from running PoC-Gym with Claude Sonnet 4, GPT-5 Medium, and gpt-oss-20b show that using static analysis for guidance and criteria lead to 21% higher success rates than the prior baseline, FaultLine. However, manual inspection of both successful and failed PoCs reveals that 71.5% of the PoCs are invalid. These results show that the reported success of LLM-based PoC generation can be significantly misleading, which is hard to detect with current validation mechanisms.
翻译:近年来,大型语言模型(LLMs)已被应用于安全漏洞检测任务,包括生成概念验证(PoC)漏洞利用程序。PoC漏洞利用程序是一种用于演示如何利用漏洞的程序。已有多种方法表明,为LLMs提供额外指导可以改善PoC生成效果,这促使我们进一步评估其有效性。在本工作中,我们开发了PoC-Gym——一个通过LLMs生成Java安全漏洞的PoC,并对生成的漏洞利用程序进行系统验证的框架。利用PoC-Gym,我们评估了静态分析工具的指导是否提高了PoC生成成功率,并对生成的PoC进行了人工审查。通过使用Claude Sonnet 4、GPT-5 Medium和gpt-oss-20b运行PoC-Gym的实验结果表明:采用静态分析进行指导和标准制定,其成功率较先前基准方法FaultLine提高了21%。然而,对成功和失败PoC的人工审查显示,71.5%的PoC是无效的。这些结果表明,当前基于LLM的PoC生成所报告的成功率可能存在显著误导性,而现有验证机制难以检测这一问题。