Beyond Random Inputs: A Novel ML-Based Hardware Fuzzing

Modern computing systems heavily rely on hardware as the root of trust. However, their increasing complexity has given rise to security-critical vulnerabilities that cross-layer at-tacks can exploit. Traditional hardware vulnerability detection methods, such as random regression and formal verification, have limitations. Random regression, while scalable, is slow in exploring hardware, and formal verification techniques are often concerned with manual effort and state explosions. Hardware fuzzing has emerged as an effective approach to exploring and detecting security vulnerabilities in large-scale designs like modern processors. They outperform traditional methods regarding coverage, scalability, and efficiency. However, state-of-the-art fuzzers struggle to achieve comprehensive coverage of intricate hardware designs within a practical timeframe, often falling short of a 70% coverage threshold. We propose a novel ML-based hardware fuzzer, ChatFuzz, to address this challenge. Ourapproach leverages LLMs like ChatGPT to understand processor language, focusing on machine codes and generating assembly code sequences. RL is integrated to guide the input generation process by rewarding the inputs using code coverage metrics. We use the open-source RISCV-based RocketCore processor as our testbed. ChatFuzz achieves condition coverage rate of 75% in just 52 minutes compared to a state-of-the-art fuzzer, which requires a lengthy 30-hour window to reach a similar condition coverage. Furthermore, our fuzzer can attain 80% coverage when provided with a limited pool of 10 simulation instances/licenses within a 130-hour window. During this time, it conducted a total of 199K test cases, of which 6K produced discrepancies with the processor's golden model. Our analysis identified more than 10 unique mismatches, including two new bugs in the RocketCore and discrepancies from the RISC-V ISA Simulator.

翻译：现代计算系统严重依赖硬件作为信任根基。然而，其日益增长的复杂性催生了跨层攻击可利用的关键安全漏洞。传统的硬件漏洞检测方法（如随机回归和形式化验证）存在局限性：随机回归虽具备可扩展性，但硬件探索速度缓慢；形式化验证技术则常受限于人工投入和状态空间爆炸问题。硬件模糊测试已成为探索和检测现代处理器等大规模设计中安全漏洞的有效方法，在覆盖率、可扩展性和效率方面均优于传统方法。然而，现有最先进的模糊测试工具难以在合理时间内实现复杂硬件设计的全面覆盖，通常无法突破70%的覆盖率阈值。为应对这一挑战，我们提出了一种基于机器学习的新型硬件模糊测试工具ChatFuzz。该方法利用ChatGPT等大语言模型理解处理器语言（聚焦于机器码），并生成汇编代码序列。通过集成强化学习，我们采用代码覆盖率指标对输入进行奖励，从而引导输入生成过程。我们将开源RISC-V架构的RocketCore处理器作为测试平台。ChatFuzz仅需52分钟即可达到75%的条件覆盖率，而最先进的模糊测试工具需要长达30小时才能达到相似条件覆盖率。此外，在130小时内仅使用10个仿真实例/许可证的有限资源池时，我们的模糊测试工具可实现80%的覆盖率。在此期间，其共执行199K个测试用例，其中6K个用例与处理器的黄金模型产生差异。通过分析，我们识别出10余种独特的不匹配现象，包括RocketCore中的两个新错误以及与RISC-V ISA模拟器存在的差异。