The security of scripting languages such as PowerShell is critical given their powerful automation and administration capabilities, often exercised with elevated privileges. Today, securing these languages still demands substantial human effort to craft and enforce rules, imposing heavy burdens on typical administrators and creating critical production risks (e.g., misoperations that shut down servers).Large language models (LLMs) have demonstrated strong capabilities in code generation, vulnerability detection, and automated repair for languages like Python and JavaScript. However, their ability to assist with generating secure scripting-language code remains largely underexplored. In this paper, we present SecGenEval-PS, a benchmark designed to systematically evaluate LLMs on secure scripting generation, security analysis, and automated repair. Our results show that both proprietary and open-source models fall short in these areas. For instance, over 60% of PowerShell scripts produced by GPT-4o and o3-mini are insecure without structured guidance.To bridge this gap, we propose PSSec, a framework that combines data synthesis with fine-tuning to enhance model security capabilities. We develop a self-debugging agent that integrates static analyzers with the reasoning abilities of advanced LLMs to synthesize large-scale structured triplets of insecure scripts, violation analyses, and corresponding repairs. We then fine-tune lightweight LLMs (as small as 1.7B parameters) using supervised fine-tuning (SFT) and reinforcement learning (RL), enabling security-aware reasoning and the generation of secure PowerShell code.Across multiple LLM families, including GPT and Qwen, \textit{PSSec}-trained models match or surpass general-purpose large models on PowerShell security tasks while reducing inference cost by more than an order of magnitude.
翻译:鉴于PowerShell等脚本语言具备强大的自动化与管理能力,且通常以提升的权限执行,其安全性至关重要。当前,保障这些语言的安全仍需要大量人工努力来制定和执行规则,这给典型管理员带来了沉重负担,并造成了关键的生产风险(例如,误操作导致服务器关闭)。大语言模型(LLM)在Python和JavaScript等语言的代码生成、漏洞检测和自动修复方面已展现出强大能力。然而,它们在协助生成安全脚本语言代码方面的潜力在很大程度上仍未得到充分探索。本文提出了SecGenEval-PS,这是一个旨在系统评估LLM在安全脚本生成、安全分析和自动修复方面性能的基准。我们的结果表明,无论是专有模型还是开源模型在这些方面均存在不足。例如,在没有结构化指导的情况下,GPT-4o和o3-mini生成的PowerShell脚本中超过60%是不安全的。为弥补这一差距,我们提出了PSSec框架,该框架结合了数据合成与微调以增强模型的安全能力。我们开发了一个自调试智能体,它将静态分析器与先进LLM的推理能力相结合,以合成大规模的结构化三元组(包含不安全脚本、违规分析和相应修复)。随后,我们使用监督微调(SFT)和强化学习(RL)对轻量化LLM(参数规模小至17亿)进行微调,使其具备安全感知推理能力并能够生成安全的PowerShell代码。在包括GPT和Qwen在内的多个LLM系列中,经过\textit{PSSec}训练的模型在PowerShell安全任务上达到或超越了通用大模型的性能,同时将推理成本降低了一个数量级以上。