Logical vulnerabilities in software stem from flaws in program logic rather than memory safety, which can lead to critical security failures. Although existing automated program repair techniques primarily focus on repairing memory corruption vulnerabilities, they struggle with logical vulnerabilities because of their limited semantic understanding of the vulnerable code and its expected behavior. On the other hand, recent successes of large language models (LLMs) in understanding and repairing code are promising. However, no framework currently exists to analyze the capabilities and limitations of such techniques for logical vulnerabilities. This paper aims to systematically evaluate both traditional and LLM-based repair approaches for addressing real-world logical vulnerabilities. To facilitate our assessment, we created the first ever dataset, LogicDS, of 86 logical vulnerabilities with assigned CVEs reflecting tangible security impact. We also developed a systematic framework, LogicEval, to evaluate patches for logical vulnerabilities. Evaluations suggest that compilation and testing failures are primarily driven by prompt sensitivity, loss of code context, and difficulty in patch localization.
翻译:软件中的逻辑漏洞源于程序逻辑缺陷而非内存安全问题,可能引发严重的安全故障。尽管现有自动程序修复技术主要聚焦于修复内存损坏漏洞,但由于其对脆弱代码及其预期行为的语义理解有限,往往难以应对逻辑漏洞。另一方面,大型语言模型在代码理解与修复领域的最新成功带来了希望。然而,目前尚无框架可系统分析这类技术在处理逻辑漏洞时的能力与局限性。本文旨在系统评估传统修复方法与基于大型语言模型的修复方法在处理真实世界逻辑漏洞时的表现。为支撑评估,我们首次构建了包含86个已分配CVE编号的逻辑漏洞数据集LogicDS,这些漏洞具有切实的安全影响。同时,我们开发了系统性评估框架LogicEval,用于评估针对逻辑漏洞的补丁。评估表明,编译与测试失败主要由提示敏感性、代码上下文丢失以及补丁定位困难引发。