The advancement of Large Language Models (LLMs) has raised concerns regarding their dual-use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints, and specialized protocols of Operational Technology (OT). To address this gap, we introduce CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments. We assess five state-of-the-art models, including OpenAI's GPT-5 suite and open-weight models, across a corpus of 81 domain-specific tasks spanning static configuration analysis, network traffic reconnaissance, and live virtual machine interaction. To facilitate industrial protocol interaction, we develop a domain-specific tool scaffold. Our empirical results show that agents reliably execute static structured-file analysis and single-tool network enumeration, but their performance degrades on dynamic tasks. Despite demonstrating explicit, internalized knowledge of the IEC 61850 standards terminology, current models struggle with the persistent sequential reasoning and state tracking required to manipulate live systems without specialized tools. Equipping agents with our domain-specific tool scaffold significantly mitigates this operational bottleneck. Code and evaluation scripts are available at: https://github.com/GKeppler/CritBench
翻译:大语言模型(LLM)的进步引发了对其在网络安全领域双重用途潜力的担忧。现有评估框架主要聚焦于信息技术(IT)环境,未能涵盖运营技术(OT)的约束条件与专用协议。为弥补这一空白,我们提出了CritBench,一个面向IEC 61850数字变电站环境中LLM智能体网络安全能力评估的创新框架。我们评估了包括OpenAI GPT-5系列与开放权重模型在内的五种最先进模型,在涵盖静态配置分析、网络流量侦察及实时虚拟机交互的81项领域特定任务语料库上的表现。为促进工业协议交互,我们开发了一个领域专用工具支架。实证结果表明:智能体在静态结构化文件分析与单工具网络枚举任务中表现稳定,但在动态任务中性能下降。尽管当前模型展现出对IEC 61850标准术语的显式内化知识,但在缺乏专用工具的情况下,它们仍难以在实时系统操作中维持持续的顺序推理与状态追踪。而配备领域专用工具支架后,该操作瓶颈得到显著缓解。代码与评估脚本已开源:https://github.com/GKeppler/CritBench