Large Language Models (LLMs) are increasingly used in software development to generate functions, such as attack detectors, that implement security requirements. However, LLMs struggle to generate accurate code, resulting, e.g., in attack detectors that miss well-known attacks when used in practice. This is most likely due to the LLM lacking knowledge about some existing attacks and to the generated code being not evaluated in real usage scenarios. We propose a novel approach integrating Retrieval Augmented Generation (RAG) and Self-Ranking into the LLM pipeline. RAG enhances the robustness of the output by incorporating external knowledge sources, while the Self-Ranking technique, inspired to the concept of Self-Consistency, generates multiple reasoning paths and creates ranks to select the most robust detector. Our extensive empirical study targets code generated by LLMs to detect two prevalent injection attacks in web security: Cross-Site Scripting (XSS) and SQL injection (SQLi). Results show a significant improvement in detection performance compared to baselines, with an increase of up to 71%pt and 37%pt in the F2-Score for XSS and SQLi detection, respectively.
翻译:大型语言模型(LLMs)在软件开发中越来越多地被用于生成实现安全需求的函数,例如攻击检测器。然而,LLMs难以生成准确的代码,导致生成的攻击检测器在实际应用时可能漏检已知攻击。这很可能是由于LLM缺乏对某些现有攻击的了解,以及生成的代码未在真实使用场景中进行评估。我们提出了一种新颖的方法,将检索增强生成(RAG)和自排序技术集成到LLM流程中。RAG通过整合外部知识源来增强输出的鲁棒性,而受自洽性概念启发的自排序技术则生成多条推理路径并创建排序,以选择最鲁棒的检测器。我们广泛的实证研究针对LLMs生成的用于检测Web安全中两种普遍注入攻击的代码:跨站脚本攻击(XSS)和SQL注入攻击(SQLi)。结果显示,与基线方法相比,检测性能显著提升,XSS和SQLi检测的F2分数分别提高了高达71个百分点和37个百分点。