Differential testing can be an effective way to find bugs in software systems with multiple implementations that conform to the same specification, like compilers, network protocol parsers, and language runtimes. Specifications for such systems are often standardized in natural language documents, like Instruction Set Architecture (ISA) specifications, Wasm specifications or IETF RFC's. Large Language Models (LLMs) have demonstrated potential in both generating tests and handling large volumes of natural language text, making them well-suited for utilizing artifacts like specification documents, bug reports, and code implementations. In this work, we leverage natural language and code artifacts to guide LLMs to generate targeted, meaningful tests that highlight meaningful behavioral differences between implementations, including those corresponding to bugs. We introduce DiffSpec, a framework for generating differential tests with LLMs using prompt chaining. We demonstrate the efficacy of DiffSpec on two different systems, namely, eBPF runtimes and Wasm validators. Using DiffSpec, we generated 359 differentiating tests, uncovering at least four distinct and confirmed bugs in eBPF, including a kernel memory leak, inconsistent behavior in jump instructions, and undefined behavior when using the stack pointer. We also found 279 differentiating tests in Wasm validators, that point to at least 2 confirmed and fixed bugs.
翻译:差分测试对于在遵循相同规范的多个实现中(如编译器、网络协议解析器和语言运行时)发现软件系统缺陷是一种有效方法。此类系统的规范通常以自然语言文档形式标准化,例如指令集架构(ISA)规范、WebAssembly规范或IETF RFC。大型语言模型(LLM)在生成测试和处理大量自然语言文本方面已展现出潜力,使其特别适合利用规范文档、缺陷报告和代码实现等构件。本研究通过结合自然语言与代码构件引导LLM生成具有针对性、能揭示实现间显著行为差异(包括对应缺陷的差异)的测试用例。我们提出DiffSpec——一个基于提示链技术、利用LLM生成差分测试的框架。我们在两个不同系统(即eBPF运行时与WebAssembly验证器)上验证了DiffSpec的有效性。通过DiffSpec,我们生成了359个差异化测试用例,在eBPF中发现了至少四个已确认的独立缺陷,包括内核内存泄漏、跳转指令行为不一致以及使用栈指针时的未定义行为。同时,我们在WebAssembly验证器中发现了279个差异化测试用例,对应至少两个已确认并修复的缺陷。