Leading commercial endpoint detection and response (EDR) products have shifted from operator-configured rule sets to multi-component systems where autonomous AI components operate alongside, and increasingly in place of, operator-deployed policies. Autonomous defense agents using commercial EDR as their hardening tool are no longer tuning a passive tool, but a black-box autonomous system capable of making vendor-specific decisions. We present the first evaluation framework for autonomous defense agents hardening commercial EDR. We instantiate it in a Game of Active Directory (GOAD) lab with Horizon3.ai's NodeZero as the autonomous pentester and Microsoft Defender XDR as the EDR. We run a sample benchmark of defense agents with two large language model (LLM) backbones (Claude Sonnet 4.6 and Cisco Foundation-Sec-8B). We report three lessons learned that neither simulation nor open-source-EDR evaluation can surface: (i) commercial EDR telemetry is engineered for Security Operations Center (SOC) analyst workflows rather than scientific benchmarking; (ii) the importance of per-policy attribution to separate defense agent actions from autonomous EDR actions; and (iii) the EDR's autonomous behavior varies during the evaluation window. Together, these findings highlight a sim-to-real gap for enterprise defense and motivate evaluation methodology for benchmarking autonomous defense agents in environments with black-box, autonomous tools.
翻译:领先的商业端点检测与响应(EDR)产品已从操作员配置的规则集转变为多组件系统,其中自主AI组件与操作员部署的策略并行运作,并日益取代后者。使用商业EDR作为加固工具的自主防御智能体不再是在调优一个被动工具,而是在与一个能够做出供应商特定决策的黑盒自主系统互动。我们提出了首个评估框架,用于评估加固商业EDR的自主防御智能体。我们在活动目录博弈(GOAD)实验室中实例化该框架,使用Horizon3.ai的NodeZero作为自主渗透测试工具,Microsoft Defender XDR作为EDR。我们运行了一个包含两个大语言模型(LLM)骨干(Claude Sonnet 4.6和Cisco Foundation-Sec-8B)的防御智能体样本基准测试。我们报告了三项模拟与开源EDR评估无法揭示的经验教训:(i) 商业EDR遥测是为安全运营中心(SOC)分析师工作流而非科学基准测试设计的;(ii) 必须进行逐策略归因以区分防御智能体行为与自主EDR行为;以及(iii) EDR的自主行为在评估窗口期内会发生变化。综合而言,这些发现凸显了企业防御中存在的模拟与现实的鸿沟,并推动了在包含黑盒自主工具的环境中为自主防御智能体建立基准测试方法的评估方法论。