With the growing interest on Large Language Models (LLMs) for fault localization and program repair, ensuring the integrity and generalizability of the LLM-based methods becomes paramount. The code in existing widely-adopted benchmarks for these tasks was written before the the bloom of LLMs and may be included in the training data of existing popular LLMs, thereby suffering from the threat of data leakage, leading to misleadingly optimistic performance metrics. To address this issue, we introduce "ConDefects", a novel dataset of real faults meticulously curated to eliminate such overlap. ConDefects contains 1,254 Java faulty programs and 1,625 Python faulty programs. All these programs are sourced from the online competition platform AtCoder and were produced between October 2021 and September 2023. We pair each fault with fault locations and the corresponding repaired code versions, making it tailored for in fault localization and program repair related research. We also provide interfaces for selecting subsets based on different time windows and coding task difficulties. While inspired by LLM-based tasks, ConDefects can be adopted for benchmarking ALL types of fault localization and program repair methods. The dataset is publicly available, and a demo video can be found at https://www.youtube.com/watch?v=22j15Hj5ONk.
翻译:随着大语言模型(LLMs)在故障定位与程序修复领域受到日益关注,确保基于LLM的方法的完整性与泛化能力变得至关重要。现有广泛使用的基准测试数据集中包含的代码编写于LLMs兴起之前,可能已被纳入当前主流LLMs的训练数据,从而面临数据泄漏风险,导致性能指标出现误导性乐观评估。为解决这一问题,我们提出"ConDefects"——一个精心构建的新型真实缺陷数据集,旨在消除此类重叠。ConDefects包含1,254个Java故障程序与1,625个Python故障程序,所有程序均来自在线竞赛平台AtCoder,产生于2021年10月至2023年9月期间。我们为每个缺陷配对了故障位置及对应的修复代码版本,使其专门适用于故障定位与程序修复相关研究。此外,我们还提供了基于不同时间窗口和编码任务难度的子集选择接口。尽管受基于LLM任务的启发,ConDefects同样可用于基准测试所有类型的故障定位与程序修复方法。该数据集已公开,演示视频可访问https://www.youtube.com/watch?v=22j15Hj5ONk。