An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

Getting a real cybersecurity risk assessment for a small organization is expensive -- a NIST CSF-aligned engagement runs $15,000 on the low end, takes weeks, and depends on practitioners who are genuinely scarce. Most small companies skip it entirely. We built a six-agent AI system where each agent handles one analytical stage: profiling the organization, mapping assets, analyzing threats, evaluating controls, scoring risks, and generating recommendations. Agents share a persistent context that grows as the assessment proceeds, so later agents build on what earlier ones concluded -- the mechanism that distinguishes this from standard sequential agent pipelines. We tested it on a 15-person HIPAA-covered healthcare company and compared outputs to independent assessments by three CISSP practitioners -- the system agreed with them 85% of the time on severity classifications, covered 92% of identified risks, and finished in under 15 minutes. We then ran 30 repeated single-agent assessments across five synthetic but sector-realistic organizational profiles in healthcare, fintech, manufacturing, retail, and SaaS, comparing a general-purpose Mistral-7B against a domain fine-tuned model. Both completed every run. The fine-tuned model flagged threats the baseline could not see at all: PHI exposure in healthcare, OT/IIoT vulnerabilities in manufacturing, platform-specific risks in retail. The full multi-agent pipeline, however, failed every one of 30 attempts on a Tesla T4 with its 4,096-token default context window -- context capacity, not model quality, turned out to be the binding constraint.

翻译：对小型组织而言，进行真实的网络安全风险评估成本高昂——遵循NIST CSF（美国国家标准与技术研究院网络安全框架）的评估项目最低花费约1.5万美元，耗时数周，且依赖极为稀缺的专业从业人员。因此，大多数小公司完全忽略了这项工作。我们构建了一个六智能体AI系统，其中每个智能体负责一个分析阶段：组织概况分析、资产映射、威胁分析、控制措施评估、风险评分以及建议生成。智能体共享一个随评估进程不断增长的持久化上下文，使得后续智能体能够基于前序结论进行推理——这一机制使其区别于标准的顺序智能体流水线。我们针对一家拥有15名员工的HIPAA（健康保险携带与责任法案）监管医疗机构进行了测试，并将输出结果与三位CISSP（注册信息系统安全专家）从业者的独立评估进行比较：该系统在严重等级分类上与从业者的一致率达85%，覆盖了92%已识别风险，且在15分钟内完成评估。随后，我们在医疗、金融科技、制造、零售和SaaS五个行业实际典型但经过合成的组织概况上，执行了30次重复单智能体评估，对比了通用型Mistral-7B模型与领域微调模型。两种模型均完成了所有评估运行。领域微调模型标记出了基线模型完全无法识别的威胁：医疗领域的受保护健康信息暴露、制造领域的OT/IIoT（操作技术/工业物联网）漏洞、零售领域的平台特定风险。然而，完整的多智能体流水线在配备默认4096词元上下文窗口的Tesla T4上，30次尝试全部失败——最终发现上下文容量而非模型质量才是制约瓶颈。