Security misconfigurations remain a leading cause of OS-level compromise, and manually keeping systems compliant with standards like Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs) is a tedious and expensive process. Existing compliance automation tools can reduce some of this burden, but they depend on static, pre-written corrective actions. In this paper, we introduce SHIELDS, a multi-agent system that uses large language models (LLMs) to approach OS hardening as an iterative, feedback-driven process. Instead of applying fixed remediations, SHIELDS continuously proposes fixes and refines them based on feedback from target system execution and validation scans. We evaluate the system across multiple virtual machine configurations using six contemporary LLMs ranging from 20B to 400B parameters, and find that SHIELDS successfully remediates up to 73% of scan findings. Our results also suggest that success in this setting depends less on model size (parameter count) than on effective tool use and information gathering, paving a practical path toward reducing the burden of security compliance in environments where compute is limited or security and privacy needs drive local model use.
翻译:安全配置错误仍然是操作系统层面被攻破的主要原因,而手动使系统符合如国防信息系统局(DISA)安全技术实施指南(STIGs)等标准,是一个繁琐且成本高昂的过程。现有的合规自动化工具可以减轻部分负担,但它们依赖于静态的、预编写的纠正措施。在本文中,我们介绍了SHIELDS,一个多智能体系统,它利用大语言模型(LLMs)将操作系统安全加固视为一个迭代的、反馈驱动的过程。SHIELDS不应用固定的修复措施,而是根据目标系统执行和验证扫描的反馈,持续提出修复方案并在其基础上进行改进。我们使用六种参数规模从20B到400B不等的当代LLMs,在多个虚拟机配置上对该系统进行了评估,发现SHIELDS成功修复了高达73%的扫描发现项。我们的结果还表明,在此场景下,成功的决定性因素较少依赖于模型大小(参数数量),而更多地依赖于有效的工具使用和信息收集,这为在算力受限、或出于安全与隐私需求而驱动使用本地模型的环境中减轻安全合规负担,开辟了一条切实可行的路径。