Chain-of-Thought Prompting of Large Language Models for Discovering and Fixing Software Vulnerabilities

Security vulnerabilities are increasingly prevalent in modern software and they are widely consequential to our society. Various approaches to defending against these vulnerabilities have been proposed, among which those leveraging deep learning (DL) avoid major barriers with other techniques hence attracting more attention in recent years. However, DL-based approaches face critical challenges including the lack of sizable and quality-labeled task-specific datasets and their inability to generalize well to unseen, real-world scenarios. Lately, large language models (LLMs) have demonstrated impressive potential in various domains by overcoming those challenges, especially through chain-of-thought (CoT) prompting. In this paper, we explore how to leverage LLMs and CoT to address three key software vulnerability analysis tasks: identifying a given type of vulnerabilities, discovering vulnerabilities of any type, and patching detected vulnerabilities. We instantiate the general CoT methodology in the context of these tasks through VSP , our unified, vulnerability-semantics-guided prompting approach, and conduct extensive experiments assessing VSP versus five baselines for the three tasks against three LLMs and two datasets. Results show substantial superiority of our CoT-inspired prompting (553.3%, 36.5%, and 30.8% higher F1 accuracy for vulnerability identification, discovery, and patching, respectively, on CVE datasets) over the baselines. Through in-depth case studies analyzing VSP failures, we also reveal current gaps in LLM/CoT for challenging vulnerability cases, while proposing and validating respective improvements.

翻译：安全漏洞在现代软件中日益普遍，并对社会产生广泛影响。为抵御这些漏洞，研究者提出了多种方法，其中基于深度学习的技术规避了其他方法的重大障碍，因此近年来备受关注。然而，基于深度学习的方法仍面临关键挑战，包括缺乏大规模、高质量标注的任务专用数据集，以及难以有效泛化至未见过的真实场景。近期，大型语言模型通过克服这些挑战，尤其在思维链提示技术中展现出在各领域的惊人潜力。本文探索如何利用大型语言模型与思维链技术完成三项关键的软件漏洞分析任务：识别特定类型漏洞、发现任意类型漏洞以及修复已检测漏洞。我们通过VSP方法（一种统一的、漏洞语义引导的提示方法）在这些任务中实例化通用思维链方法论，并针对三项任务，在三个大型语言模型与两个数据集上评估VSP与五种基线的性能。实验结果表明，我们的思维链启发式提示方法在CVE数据集上显著优于基线（漏洞识别、发现与修复的F1准确率分别提升553.3%、36.5%和30.8%）。通过深入分析VSP失败案例的实例研究，我们揭示了当前大型语言模型与思维链技术在应对复杂漏洞案例时的局限性，同时提出并验证了相应的改进方案。