An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

Getting a real cybersecurity risk assessment for a small organization is expensive -- a NIST CSF-aligned engagement runs $15,000 on the low end, takes weeks, and depends on practitioners who are genuinely scarce. Most small companies skip it entirely. We built a six-agent AI system where each agent handles one analytical stage: profiling the organization, mapping assets, analyzing threats, evaluating controls, scoring risks, and generating recommendations. Agents share a persistent context that grows as the assessment proceeds, so later agents build on what earlier ones concluded -- the mechanism that distinguishes this from standard sequential agent pipelines. We tested it on a 15-person HIPAA-covered healthcare company and compared outputs to independent assessments by three CISSP practitioners -- the system agreed with them 85% of the time on severity classifications, covered 92% of identified risks, and finished in under 15 minutes. We then ran 30 repeated single-agent assessments across five synthetic but sector-realistic organizational profiles in healthcare, fintech, manufacturing, retail, and SaaS, comparing a general-purpose Mistral-7B against a domain fine-tuned model. Both completed every run. The fine-tuned model flagged threats the baseline could not see at all: PHI exposure in healthcare, OT/IIoT vulnerabilities in manufacturing, platform-specific risks in retail. The full multi-agent pipeline, however, failed every one of 30 attempts on a Tesla T4 with its 4,096-token default context window -- context capacity, not model quality, turned out to be the binding constraint.

翻译：小型组织实施真正的网络安全风险评估成本高昂——基于NIST CSF框架的评估最低端费用为15000美元，耗时数周，且依赖于稀缺的专业从业者。大多数小公司完全跳过这一流程。我们构建了一个六智能体AI系统，每个智能体负责一个分析阶段：组织概况分析、资产映射、威胁分析、控制措施评估、风险评分及建议生成。智能体间共享随评估进程持续增长的上下文信息，使后续智能体能基于前期结论递进分析——这一机制使其区别于标准顺序智能体流水线。我们以一家15人规模的HIPAA覆盖医疗企业为测试对象，将系统输出与三名CISSP从业者的独立评估进行对比：系统在严重程度分类上的共识率达到85%，覆盖92%的已识别风险，且全程耗时不足15分钟。随后，我们在医疗、金融科技、制造业、零售及SaaS五个行业典型的合成组织档案上进行了30次重复单智能体评估，对比通用型Mistral-7B与领域微调模型。两个模型均完成全部运行。微调模型标识出基线模型完全无法发现的威胁：医疗行业的PHI暴露风险、制造业的OT/IIoT漏洞、零售行业的平台特定风险。然而，完整的多智能体流水线在配备4096词元默认上下文窗口的Tesla T4上，30次尝试全部失败——上下文容量而非模型质量，最终成为约束瓶颈。