Greybox fuzzing is one of the most popular methods for detecting software vulnerabilities, which conducts a biased random search within the program input space. To enhance its effectiveness in achieving deep coverage of program behaviors, greybox fuzzing is often combined with concolic execution, which performs a path-sensitive search over the domain of program inputs. In hybrid fuzzing, conventional greybox fuzzing is followed by concolic execution in an iterative loop, where reachability roadblocks encountered by greybox fuzzing are tackled by concolic execution. However, such hybrid fuzzing still suffers from difficulties conventionally faced by concolic execution, such as the need for environment modeling and system call support. In this work, we explore the potential of developing "smart" concolic execution empowered by Large Language Models (LLMs), leveraging their knowledge of code semantics during constraint computing and solving. When coverage-based greybox fuzzing reaches a roadblock in terms of reaching certain branches, we conduct a slicing on the execution trace and suggest modifications of the input to reach the relevant branches. The LLM is used as a solver to generate the modified input to reach the desired branches. Compared with state-of-the-art hybrid fuzzers CoFuzz, Intriguer, and QSYM, our LLM-based hybrid fuzzer HyllFuzz(pronounced "hill fuzz") covers 31.43%, 44.56%, and 59.48% more code branches, respectively. Furthermore, the LLM-based concolic execution in HyllFuzz takes a time that is 3--19 times faster than the concolic execution running in existing hybrid fuzzing tools. In extensively tested real-world subjects, HyllFuzz exposed seven previously unknown bugs. This experience shows that LLMs can be effectively inserted into the iterative loop of hybrid fuzzers to efficiently expose more program behaviors.
翻译:灰盒模糊测试是检测软件漏洞最流行的方法之一,它在程序输入空间内进行有偏随机搜索。为增强其实现程序行为深度覆盖的效果,灰盒模糊测试常与路径敏感地遍历程序输入域的具体执行相结合。在混合式模糊测试中,传统灰盒模糊测试与具体执行在迭代循环中交替进行,灰盒模糊测试遇到的可达性障碍由具体执行解决。然而,此类混合式模糊测试仍面临具体执行常规的困难,例如需要环境建模和系统调用支持。本研究探索利用大语言模型开发"智能"具体执行的潜力,在约束计算与求解过程中借助其对代码语义的知识。当基于覆盖率的灰盒模糊测试在到达特定分支时遇到障碍,我们对执行轨迹进行切片,并提出输入修改建议以到达相关分支。大语言模型作为求解器生成修改后的输入以抵达目标分支。与现有最先进的混合式模糊测试工具CoFuzz、Intriguer和QSYM相比,本研究提出的基于大语言模型的混合式模糊测试工具HyllFuzz(读作"hill fuzz")分别多覆盖了31.43%、44.56%和59.48%的代码分支。此外,HyllFuzz中基于大语言模型的具体执行耗时比现有混合式模糊测试工具中的具体执行快3到19倍。在广泛测试的真实世界程序中,HyllFuzz暴露了七个此前未知的漏洞。这一经验表明,大语言模型能够有效嵌入混合式模糊测试的迭代循环中,以高效揭示更多程序行为。