基于定位引导指令的正则表达式漏洞修复 (Repairing Regex Vulnerabilities via Localization-Guided Instructions)

Regular expressions (regexes) are foundational to modern computing for critical tasks like input validation and data parsing, yet their ubiquity exposes systems to regular expression denial of service (ReDoS), a vulnerability requiring automated repair methods. Current approaches, however, are hampered by a trade-off. Symbolic, rule-based system are precise but fails to repair unseen or complex vulnerability patterns. Conversely, large language models (LLMs) possess the necessary generalizability but are unreliable for tasks demanding strict syntactic and semantic correctness. We resolve this impasse by introducing a hybrid framework, localized regex repair (LRR), designed to harness LLM generalization while enforcing reliability. Our core insight is to decouple problem identification from the repair process. First, a deterministic, symbolic module localizes the precise vulnerable subpattern, creating a constrained and tractable problem space. Then, the LLM invoked to generate a semantically equivalent fix for this isolated segment. This combined architecture successfully resolves complex repair cases intractable for rule-based repair while avoiding the semantic errors of LLM-only approaches. Our work provides a validated methodology for solving such problems in automated repair, improving the repair rate by 15.4%p over the state-of-the-art. Our code is available at https://github.com/cdltlehf/LRR.

翻译：正则表达式（regexes）是现代计算中执行输入验证和数据解析等关键任务的基础，但其普遍性也使系统面临正则表达式拒绝服务（ReDoS）漏洞，这需要自动化修复方法。然而，现有方法受限于一种权衡：基于符号和规则的系统虽然精确，但无法修复未见或复杂的漏洞模式；相反，大型语言模型（LLMs）具备必要的泛化能力，但在需要严格语法和语义正确性的任务中不可靠。我们通过引入一种混合框架——局部化正则表达式修复（LRR）来解决这一困境，该框架旨在利用LLM的泛化能力同时确保可靠性。我们的核心见解是将问题识别与修复过程解耦：首先，一个确定性的符号模块定位精确的易受攻击子模式，创建一个受限且可处理的问题空间；然后，调用LLM为此隔离片段生成语义等效的修复。这种组合架构成功解决了基于规则的修复方法难以处理的复杂修复案例，同时避免了纯LLM方法的语义错误。我们的工作为自动化修复中的此类问题提供了经过验证的方法，将修复率比现有最优方法提高了15.4个百分点。代码可在https://github.com/cdltlehf/LRR获取。