Large language models (LLMs) are being increasingly integrated into practical hardware and firmware development pipelines for code generation. Existing studies have primarily focused on evaluating the functional correctness of LLM-generated code, yet paid limited attention to its security issues. However, LLM-generated code that appears functionally sound may embed security flaws which could induce catastrophic damages after deployment. This critical research gap motivates us to design a benchmark for assessing security awareness under realistic specifications. In this work, we introduce HardSecBench, a benchmark with 924 tasks spanning Verilog Register Transfer Level (RTL) and firmware-level C, covering 76 hardware-relevant Common Weakness Enumeration (CWE) entries. Each task includes a structured specification, a secure reference implementation, and executable tests. To automate artifact synthesis, we propose a multi-agent pipeline that decouples synthesis from verification and grounds evaluation in execution evidence, enabling reliable evaluation. Using HardSecBench, we evaluate a range of LLMs on hardware and firmware code generation and find that models often satisfy functional requirements while still leaving security risks. We also find that security results vary with prompting. These findings highlight pressing challenges and offer actionable insights for future advancements in LLM-assisted hardware design. Our data and code will be released soon.
翻译:大型语言模型(LLM)正日益集成到实际的硬件和固件开发流程中以支持代码生成。现有研究主要关注评估LLM生成代码的功能正确性,而对其安全性问题的关注有限。然而,看似功能正常的LLM生成代码可能潜藏安全漏洞,这些漏洞在部署后可能引发灾难性损害。这一关键的研究空白促使我们设计一个基准,用于评估现实规范下的安全认知水平。本文中,我们提出了HardSecBench,这是一个包含924项任务的基准测试集,涵盖Verilog寄存器传输级(RTL)和固件级C语言,涉及76个与硬件相关的常见缺陷枚举(CWE)条目。每项任务均包含结构化规范、安全参考实现及可执行测试。为实现工件的自动化合成,我们提出了一种多智能体流程,该流程将合成与验证解耦,并将评估基于执行证据,从而实现可靠的评估。利用HardSecBench,我们对一系列LLM在硬件和固件代码生成任务上进行了评估,发现模型通常能满足功能需求,但仍会遗留安全风险。我们还发现安全性能随提示策略的变化而波动。这些发现揭示了当前面临的紧迫挑战,并为LLM辅助硬件设计的未来发展提供了可操作的见解。我们的数据与代码即将公开发布。