Program Slicing in the Era of Large Language Models

Program slicing is a critical technique in software engineering, enabling developers to isolate relevant portions of code for tasks such as bug detection, code comprehension, and debugging. In this study, we investigate the application of large language models (LLMs) to both static and dynamic program slicing, with a focus on Java programs. We evaluate the performance of four state-of-the-art LLMs- GPT-4o, GPT-3.5 Turbo, Llama-2, and Gemma-7B leveraging advanced prompting techniques, including few-shot learning and chain-of-thought reasoning. Using a dataset of 100 Java programs derived from LeetCode problems, our experiments reveal that GPT-4o performs the best in both static and dynamic slicing across other LLMs, achieving an accuracy of 60.84% and 59.69%, respectively. Our results also show that the LLMs we experimented with are yet to achieve reasonable performance for either static slicing or dynamic slicing. Through a rigorous manual analysis, we developed a taxonomy of root causes and failure locations to explore the unsuccessful cases in more depth. We identified Complex Control Flow as the most frequent root cause of failures, with the majority of issues occurring in Variable Declarations and Assignments locations. To improve the performance of LLMs, we further examined two independent strategies for prompting guided by our taxonomy, including prompt crafting, which involved refining the prompts to better guide the LLM through the slicing process, and iterative prompting, where the model receives feedback on the root cause and location of the failure and re-generates its responses. Our evaluation shows these two prompting enhancement approaches can improve accuracy by 4% and 3.9%, respectively.

翻译：程序切片是软件工程中的一项关键技术，它使开发者能够隔离代码的相关部分，以完成错误检测、代码理解和调试等任务。本研究探讨了大型语言模型（LLMs）在静态和动态程序切片中的应用，重点关注Java程序。我们评估了四种先进LLM——GPT-4o、GPT-3.5 Turbo、Llama-2和Gemma-7B——的性能，并利用了包括少样本学习和思维链推理在内的高级提示技术。基于源自LeetCode问题的100个Java程序数据集，我们的实验表明，GPT-4o在静态和动态切片中的表现均优于其他LLM，准确率分别达到60.84%和59.69%。我们的结果也表明，所实验的LLM在静态切片或动态切片方面尚未达到合理的性能水平。通过严格的人工分析，我们构建了一个根本原因和故障位置的分类体系，以更深入地探究失败案例。我们确定“复杂控制流”是最常见的失败根本原因，而大多数问题发生在“变量声明与赋值”位置。为了提高LLM的性能，我们进一步研究了两种基于该分类体系的独立提示策略，包括提示构建（即优化提示以更好地引导LLM完成切片过程）和迭代提示（即模型接收关于失败根本原因和位置的反馈并重新生成其响应）。我们的评估表明，这两种提示增强方法可以分别将准确率提高4%和3.9%。