Large Language Models (LLMs) have shown promise in multiple software engineering tasks including code generation, code summarisation, test generation and code repair. Fault localisation is essential for facilitating automatic program debugging and repair, and is demonstrated as a highlight at ChatGPT-4's launch event. Nevertheless, there has been little work understanding LLMs' capabilities for fault localisation in large-scale open-source programs. To fill this gap, this paper presents an in-depth investigation into the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Using the widely-adopted Defects4J dataset, we compare the two LLMs with the existing fault localisation techniques. We also investigate the stability and explanation of LLMs in fault localisation, as well as how prompt engineering and the length of code context affect the fault localisation effectiveness. Our findings demonstrate that within a limited code context, ChatGPT-4 outperforms all the existing fault localisation methods. Additional error logs can further improve ChatGPT models' localisation accuracy and stability, with an average 46.9% higher accuracy over the state-of-the-art baseline SmartFL in terms of TOP-1 metric. However, performance declines dramatically when the code context expands to the class-level, with ChatGPT models' effectiveness becoming inferior to the existing methods overall. Additionally, we observe that ChatGPT's explainability is unsatisfactory, with an accuracy rate of only approximately 30%. These observations demonstrate that while ChatGPT can achieve effective fault localisation performance under certain conditions, evident limitations exist. Further research is imperative to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.
翻译:大语言模型(LLMs)在多项软件工程任务中展现出潜力,包括代码生成、代码摘要、测试生成和代码修复。故障定位对于促进自动化程序调试与修复至关重要,并在ChatGPT-4的发布活动中被列为重点展示。然而,目前关于LLMs在大型开源程序中故障定位能力的研究尚显不足。为填补这一空白,本文深入研究了两种最先进的LLMs——ChatGPT-3.5与ChatGPT-4——在故障定位上的表现。通过采用广泛使用的Defects4J数据集,我们将这两种LLM与现有故障定位技术进行了比较。同时,我们探究了LLMs在故障定位中的稳定性与可解释性,以及提示工程和代码上下文长度对故障定位效果的影响。研究结果表明,在有限的代码上下文中,ChatGPT-4优于所有现有故障定位方法。引入额外错误日志可进一步提升ChatGPT模型的定位精度与稳定性,在TOP-1指标上平均比最先进的基线方法SmartFL高出46.9%。然而,当代码上下文扩展至类级别时,性能急剧下降,ChatGPT模型的效果整体劣于现有方法。此外,我们观察到ChatGPT的可解释性不理想,准确率仅约30%。这些观察证明,尽管ChatGPT在特定条件下能实现有效的故障定位性能,但其存在明显局限。为充分发挥ChatGPT等LLM在实际故障定位应用中的潜力,仍需开展进一步研究。