Mining software repositories is a useful technique for researchers and practitioners to see what software developers actually do when developing software. Tools like Boa provide users with the ability to easily mine these open-source software repositories at a very large scale, with datasets containing hundreds of thousands of projects. The trade-off is that users must use the provided infrastructure, query language, runtime, and datasets and this might not fit all analysis needs. In this work, we present Boidae: a family of Boa installations controlled and customized by users. Boidae uses automation tools such as Ansible and Docker to facilitate the deployment of a customized Boa installation. In particular, Boidae allows the creation of custom datasets generated from any set of Git repositories, with helper scripts to aid in finding and cloning repositories from GitHub and SourceForge. In this paper, we briefly describe the architecture of Boidae and how researchers can utilize the infrastructure to generate custom datasets. Boidae's scripts and all infrastructure it builds upon are open-sourced. A video demonstration of Boidae's installation and extension is available at https://go.unl.edu/boidae.
翻译:挖掘软件仓库是研究人员和从业者了解软件开发人员实际开发行为的有效技术。像Boa这样的工具使用户能够大规模轻松挖掘这些开源软件仓库,其数据集包含数十万个项目。但权衡之处在于,用户必须使用Boa提供的基础设施、查询语言、运行时和数据集,这可能无法满足所有分析需求。本研究提出Boidae:一个由用户控制并定制的Boa安装系列。Boidae利用Ansible和Docker等自动化工具,简化定制化Boa实例的部署流程。具体而言,Boidae支持从任意Git仓库创建自定义数据集,并提供辅助脚本帮助从GitHub和SourceForge查找与克隆仓库。本文简要描述Boidae的架构,以及研究人员如何利用该基础设施生成定制数据集。Boidae的脚本及其依赖的全部基础设施均已开源。Boidae安装与扩展的视频演示见https://go.unl.edu/boidae。