LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security's detection team and those written by Sublime Security's Automated Detection Engineer (ADE), with a thorough analysis of ADE's skills presented in the results section.
翻译:LLM在安全环境中的应用日益普遍,但其有效性缺乏量化评估,这限制了安全从业者对它们的信任度和实用价值。本文提出一个开源评估框架和基准指标,用于评估LLM生成的网络安全规则。该基准采用基于保留集的评估方法,通过对比人工生成的规则语料库来衡量LLM生成安全规则的有效性。我们借鉴专家评估安全规则的方式设计了三个关键指标,为基于LLM的安全规则生成器提供真实、多维度的有效性评估。该方法通过Sublime Security检测团队编写的规则及其自动化检测引擎(ADE)生成的规则进行实证说明,并在结果部分对ADE的能力进行了深入分析。