Bug reports provide critical insights into software quality, yet existing datasets often suffer from limited scope, outdated content, or insufficient metadata for machine learning. To address these limitations, we present GitBugs-a comprehensive and up-to-date dataset comprising over 150,000 bug reports from nine actively maintained open-source projects, including Firefox, Cassandra, and VS Code. GitBugs aggregates data from Github, Bugzilla and Jira issue trackers, offering standardized categorical fields for classification tasks and predefined train/test splits for duplicate bug detection. In addition, it includes exploratory analysis notebooks and detailed project-level statistics, such as duplicate rates and resolution times. GitBugs supports various software engineering research tasks, including duplicate detection, retrieval augmented generation, resolution prediction, automated triaging, and temporal analysis. The openly licensed dataset provides a valuable cross-project resource for benchmarking and advancing automated bug report analysis. Access the data and code at https://github.com/av9ash/gitbugs/.
翻译:缺陷报告为软件质量提供了关键洞察,但现有数据集常存在范围有限、内容过时或机器学习所需元数据不足等问题。为应对这些局限性,本文提出GitBugs——一个全面且最新的数据集,涵盖来自九个活跃维护的开源项目(包括Firefox、Cassandra和VS Code)的超过15万份缺陷报告。GitBugs聚合了来自Github、Bugzilla和Jira问题跟踪系统的数据,为分类任务提供标准化分类字段,并为重复缺陷检测提供预定义的训练/测试划分。此外,该数据集包含探索性分析笔记本和详细的项目级统计数据,如重复率和解决时间。GitBugs支持多种软件工程研究任务,包括重复检测、检索增强生成、解决预测、自动分类处理和时序分析。该开放许可数据集为基准测试和推进自动化缺陷报告分析提供了宝贵的跨项目资源。数据与代码访问地址:https://github.com/av9ash/gitbugs/。