The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer-Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor-specific nuances or knowledge in formalstandards, like, e.g., IEC 62443. Analysis of model scaling trend and release date demonstratesremarkable gains in parameter efficiency, while recent larger models show diminishing returns.Code and evaluation scripts are available at: https://github.com/GKeppler/CyberCertBench.
翻译:大型语言模型(LLMs)在专业工作流程中的快速演进与应用,要求我们依据行业标准评估其领域特定知识。我们提出了CyberCertBench——一套源自行业认可认证的全新多项选择题问答(MCQA)基准测试集。CyberCertBench依据信息技术网络安全专业标准,以及操作技术及相关网络安全标准等更专门领域,评估LLMs的领域知识。同时,我们提出并验证了一种新颖的提议者-验证者框架(Proposer-Verifier framework),该方法论可生成可解释的自然语言解释以说明模型性能。我们的评估表明,前沿模型在通用网络与IT安全知识方面已达到人类专家水平。然而,在涉及供应商特定细节或正式标准(例如IEC 62443)知识的问题中,其准确率显著下降。对模型扩展趋势与发布日期的分析显示,参数效率取得了显著提升,但近期更大规模的模型表现出边际收益递减现象。代码与评估脚本见:https://github.com/GKeppler/CyberCertBench。