Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.
翻译:安全团队面临一项挑战:新披露的通用漏洞与暴露(CVE)数量远超人工开发检测机制的能力。2025年,国家漏洞数据库发布了超过48000个新漏洞,凸显了自动化的迫切需求。我们提出RuleForge——一个AWS内部系统,能够从描述CVE细节的结构化Nuclei模板中自动生成检测规则,即识别利用特定漏洞的恶意HTTP请求的JSON模式。Nuclei模板提供了标准化的、基于YAML的漏洞描述,作为规则生成过程的结构化输入。本文重点阐述RuleForge在CVE相关威胁检测中的架构与运维部署,尤其强调其新颖的大语言模型作为评审(LLM-as-a-judge)置信度验证系统及系统性反馈集成机制。该验证方法从两个维度——敏感度(避免漏报)和特异性(避免误报)——评估候选规则,在生成环境中实现了0.75的AUROC,且相较于仅基于合成测试的验证,误报率降低了67%。我们的5×5生成策略(每次并行生成五个候选规则,每个规则最多进行五次精炼尝试)结合持续反馈循环,实现了系统性的质量提升。我们还展示了从非结构化数据源生成规则的扩展方案,并针对多事件类型检测概念验证了代理驱动工作流。经验教训强调了将大语言模型应用于网络安全任务的关键考量因素,包括缓解过度自信、领域专业知识在提示设计中的重要性,以及通过人机协同验证对生成规则进行质量审查的必要性。