Critical open source software systems undergo significant validation in the form of lengthy fuzz campaigns. The fuzz campaigns typically conduct a biased random search over the domain of program inputs, to find inputs which crash the software system. Such fuzzing is useful to enhance the security of software systems in general since even closed source software may use open source components. Hence testing open source software is of paramount importance. Currently OSS-Fuzz is the most significant and widely used infrastructure for continuous validation of open source systems. Unfortunately even though OSS-Fuzz has identified more than 10,000 vulnerabilities across 1000 or more software projects, the detected vulnerabilities may remain unpatched, as vulnerability fixing is often manual in practice. In this work, we rely on the recent progress in Large Language Model (LLM) agents for autonomous program improvement including bug fixing. We customise the well-known AutoCodeRover agent for fixing security vulnerabilities. This is because LLM agents like AutoCodeRover fix bugs from issue descriptions via code search. Instead for security patching, we rely on the test execution of the exploit input to extract code elements relevant to the fix. Our experience with OSS-Fuzz vulnerability data shows that LLM agent autonomy is useful for successful security patching, as opposed to approaches like Agentless where the control flow is fixed. More importantly our findings show that we cannot measure quality of patches by code similarity of the patch with reference codes (as in CodeBLEU scores used in VulMaster), since patches with high CodeBLEU scores still fail to pass given the given exploit input. Our findings indicate that security patch correctness needs to consider dynamic attributes like test executions as opposed to relying of standard text/code similarity metrics.
翻译:关键开源软件系统通常通过长时间模糊测试活动进行重要验证。此类模糊测试活动通常对程序输入域进行有偏随机搜索,以找出导致软件系统崩溃的输入。这种模糊测试总体上有利于增强软件系统的安全性,因为即使是闭源软件也可能使用开源组件。因此测试开源软件至关重要。目前OSS-Fuzz是持续验证开源系统最重要且广泛使用的基础设施。尽管OSS-Fuzz已在1000多个软件项目中识别出超过10,000个漏洞,但检测到的漏洞可能长期未被修复,因为实践中漏洞修复通常依赖人工操作。本研究基于大型语言模型(LLM)智能体在自主程序改进(包括错误修复)方面的最新进展,对著名的AutoCodeRover智能体进行定制化改造以修复安全漏洞。这是因为类似AutoCodeRover的LLM智能体通过代码搜索根据问题描述修复错误,而安全补丁生成则需要依赖漏洞利用输入的测试执行来提取与修复相关的代码元素。我们在OSS-Fuzz漏洞数据上的实验表明,相较于固定控制流的无智能体方法,LLM智能体自主性对成功生成安全补丁具有重要作用。更重要的是,研究发现不能通过补丁代码与参考代码的相似度(如VulMaster使用的CodeBLEU分数)来衡量补丁质量,因为具有高CodeBLEU分数的补丁在给定漏洞利用输入时仍然无法通过测试。这些发现表明,安全补丁正确性评估需要纳入测试执行等动态属性,而非依赖传统的文本/代码相似度度量标准。