Static Application Security Testing (SAST) tools are integral to modern DevSecOps pipelines, yet tools like CodeQL, Semgrep, and SonarQube remain fundamentally constrained: they require expert-crafted queries, generate excessive false positives, and detect only predefined vulnerability patterns. Recent work has explored augmenting SAST with Large Language Models (LLMs), but these approaches typically use LLMs to triage existing tool outputs rather than to reason about vulnerability semantics directly. We introduce QRS (Query, Review, Sanitize), a neuro-symbolic framework that inverts this paradigm. Rather than filtering results from static rules, QRS employs three autonomous agents that generate CodeQL queries from a structured schema definition and few-shot examples, then validate findings through semantic reasoning and automated exploit synthesis. This architecture enables QRS to discover vulnerability classes beyond predefined patterns while substantially reducing false positives. We evaluate QRS on full Python packages rather than isolated snippets. In 20 historical CVEs in popular PyPI libraries, QRS achieves 90.6% detection accuracy. Applied to the 100 most-downloaded PyPI packages, QRS identified 39 medium-to-high-severity vulnerabilities, 5 of which were assigned new CVEs, 5 received documentation updates, while the remaining 29 were independently discovered by concurrent researchers, validating both the severity and discoverability of these findings. QRS accomplishes this with low time overhead and manageable token costs, demonstrating that LLM-driven query synthesis and code review can complement manually curated rule sets and uncover vulnerability patterns that evade existing industry tools.
翻译:静态应用程序安全测试(SAST)工具是现代DevSecOps流程的核心组成部分,然而CodeQL、Semgrep和SonarQube等工具仍存在根本性局限:它们需要专家编写查询、产生大量误报,且仅能检测预定义的漏洞模式。近期研究探索了使用大语言模型(LLM)增强SAST,但这些方法通常仅用LLM对现有工具输出进行分级处理,而非直接对漏洞语义进行推理。我们提出了QRS(查询、审查、净化),这是一个颠覆该范式的神经符号框架。QRS并非通过静态规则过滤结果,而是部署三个自主智能体:它们从结构化模式定义和少量示例生成CodeQL查询,随后通过语义推理和自动化漏洞利用合成来验证发现。该架构使QRS能够发现超越预定义模式的漏洞类别,同时显著降低误报率。我们在完整的Python软件包而非孤立代码片段上评估QRS。针对流行PyPI库中的20个历史CVE漏洞,QRS实现了90.6%的检测准确率。在应用至下载量前100的PyPI软件包时,QRS发现了39个中高危漏洞,其中5个被分配了新CVE编号,5个获得了文档更新,其余29个被并行研究的学者独立发现,这验证了这些发现的重要性和可探测性。QRS以较低的时间开销和可控的令牌成本实现上述成果,证明LLM驱动的查询合成与代码审查能够补充人工构建的规则集,并揭示现有工业工具未能检测的漏洞模式。