Performance bugs challenge software development, degrading performance and wasting computational resources. Software developers invest substantial effort in addressing these issues. Curating these performance bugs can offer valuable insights to the software engineering research community, aiding in developing new mitigation strategies. However, there is no large-scale open-source performance bugs dataset available. To bridge this gap, we propose PerfCurator, a repository miner that collects performance bug-related commits at scale. PerfCurator employs PcBERT-KD, a 125M parameter BERT model trained to classify performance bug-related commits. Our evaluation shows PcBERT-KD achieves accuracy comparable to 7 billion parameter LLMs but with significantly lower computational overhead, enabling cost-effective deployment on CPU clusters. Utilizing PcBERT-KD as the core component, we deployed PerfCurator on a 50-node CPU cluster to mine GitHub repositories. This extensive mining operation resulted in the construction of a large-scale dataset comprising 114K performance bug-fix commits in Python, 217.9K in C++, and 76.6K in Java. Our results demonstrate that this large-scale dataset significantly enhances the effectiveness of data-driven performance bug detection systems.
翻译:性能缺陷对软件开发构成严峻挑战,其会导致性能下降并浪费计算资源。软件开发人员需投入大量精力解决此类问题。系统性地收集整理这些性能缺陷可为软件工程研究领域提供宝贵洞见,助力开发新型缓解策略。然而,目前尚缺乏大规模开源性能缺陷数据集。为填补这一空白,我们提出PerfCurator——一种能够大规模收集性能缺陷相关提交的代码库挖掘工具。PerfCurator采用PcBERT-KD作为核心分类模型,该模型是基于125M参数的BERT架构训练而成,专门用于识别性能缺陷相关提交。评估结果表明,PcBERT-KD在保持与70亿参数大语言模型相当准确率的同时,显著降低了计算开销,使得在CPU集群上进行经济高效的部署成为可能。利用PcBERT-KD作为核心组件,我们将PerfCurator部署于50节点CPU集群以挖掘GitHub代码库。通过大规模挖掘,最终构建了包含11.4万条Python性能缺陷修复提交、21.79万条C++提交及7.66万条Java提交的大规模数据集。实验证明,该大规模数据集能显著提升数据驱动型性能缺陷检测系统的效能。