FlexFL: Flexible and Effective Fault Localization with Open-Source Large Language Models

Due to the impressive code comprehension ability of Large Language Models (LLMs), a few studies have proposed to leverage LLMs to locate bugs, i.e., LLM-based FL, and demonstrated promising performance. However, first, these methods are limited in flexibility. They rely on bug-triggering test cases to perform FL and cannot make use of other available bug-related information, e.g., bug reports. Second, they are built upon proprietary LLMs, which are, although powerful, confronted with risks in data privacy. To address these limitations, we propose a novel LLM-based FL framework named FlexFL, which can flexibly leverage different types of bug-related information and effectively work with open-source LLMs. FlexFL is composed of two stages. In the first stage, FlexFL reduces the search space of buggy code using state-of-the-art FL techniques of different families and provides a candidate list of bug-related methods. In the second stage, FlexFL leverages LLMs to delve deeper to double-check the code snippets of methods suggested by the first stage and refine fault localization results. In each stage, FlexFL constructs agents based on open-source LLMs, which share the same pipeline that does not postulate any type of bug-related information and can interact with function calls without the out-of-the-box capability. Extensive experimental results on Defects4J demonstrate that FlexFL outperforms the baselines and can work with different open-source LLMs. Specifically, FlexFL with a lightweight open-source LLM Llama3-8B can locate 42 and 63 more bugs than two state-of-the-art LLM-based FL approaches AutoFL and AgentFL that both use GPT-3.5.

翻译：由于大语言模型（LLMs）在代码理解方面展现出卓越能力，已有少量研究提出利用LLMs进行故障定位（即基于LLM的故障定位），并展现出良好性能。然而，现有方法存在两方面局限：首先，这些方法灵活性不足，它们依赖触发缺陷的测试用例执行故障定位，无法利用其他可用的缺陷相关信息（如缺陷报告）。其次，这些方法基于闭源LLMs构建，虽然功能强大，但面临数据隐私风险。为克服这些局限，本文提出一种新型基于LLM的故障定位框架FlexFL，该框架能够灵活利用多种类型的缺陷相关信息，并有效适配开源LLMs。FlexFL包含两个阶段：第一阶段通过集成不同技术流派的先进故障定位技术缩减可疑代码搜索空间，生成缺陷相关方法的候选列表；第二阶段利用LLMs对第一阶段推荐的方法代码片段进行深度核查，从而优化故障定位结果。两个阶段均基于开源LLMs构建智能体，这些智能体采用统一流程设计，无需预设缺陷信息类型，且无需原生函数调用能力即可实现交互操作。在Defects4J基准上的大量实验表明，FlexFL性能优于基线方法，并能适配不同开源LLMs。具体而言，采用轻量级开源模型Llama3-8B的FlexFL，相比两种基于GPT-3.5的先进LLM故障定位方法AutoFL和AgentFL，分别多定位出42个和63个缺陷。