Novel AI-based code-writing Large Language Models (LLMs) such as OpenAI's Codex have demonstrated capabilities in many coding-adjacent domains. In this work we consider how LLMs maybe leveraged to automatically repair security relevant bugs present in hardware designs. We focus on bug repair in code written in the Hardware Description Language Verilog. For this study we build a corpus of domain-representative hardware security bugs. We then design and implement a framework to quantitatively evaluate the performance of any LLM tasked with fixing the specified bugs. The framework supports design space exploration of prompts (i.e., prompt engineering) and identifying the best parameters for the LLM. We show that an ensemble of LLMs can repair all ten of our benchmarks. This ensemble outperforms the state-of-the-art Cirfix hardware bug repair tool on its own suite of bugs. These results show that LLMs can repair hardware security bugs and the framework is an important step towards the ultimate goal of an automated end-to-end bug repair framework.
翻译:新型基于人工智能的代码编写大语言模型(LLM),如OpenAI的Codex,已在许多编程相关领域展现出卓越能力。本研究探讨如何利用LLM自动修复硬件设计中存在的安全相关缺陷,重点关注硬件描述语言Verilog编写的代码缺陷修复。为此,我们构建了一个领域代表性的硬件安全缺陷语料库,并设计实现了一个用于定量评估任意LLM修复指定缺陷性能的框架。该框架支持提示工程(即提示词设计)空间探索,并能识别LLM的最佳参数配置。实验表明,一个由多个LLM组成的集成系统可成功修复全部十个基准测试案例,其性能超越了当前最先进的硬件缺陷修复工具Cirfix在其自身缺陷测试集上的表现。这些结果表明,LLM能够有效修复硬件安全缺陷,而所提出的框架则为实现全自动端到端缺陷修复这一终极目标迈出了关键一步。