The growing popularity of machine learning (ML) and the integration of ML components with other software artifacts has led to the use of continuous integration and delivery (CI/CD) tools, such as Travis CI, GitHub Actions, etc. that enable faster integration and testing for ML projects. Such CI/CD configurations and services require synchronization during the life cycle of the projects. Several works discussed how CI/CD configuration and services change during their usage in traditional software systems. However, there is very limited knowledge of how CI/CD configuration and services change in ML projects. To fill this knowledge gap, this work presents the first empirical analysis of how CI/CD configuration evolves for ML software systems. We manually analyzed 343 commits collected from 508 open-source ML projects to identify common CI/CD configuration change categories in ML projects and devised a taxonomy of 14 co-changes in CI/CD and ML components. Moreover, we developed a CI/CD configuration change clustering tool that identified frequent CI/CD configuration change patterns in 15,634 commits. Furthermore, we measured the expertise of ML developers who modify CI/CD configurations. Based on this analysis, we found that 61.8% of commits include a change to the build policy and minimal changes related to performance and maintainability compared to general open-source projects. Additionally, the co-evolution analysis identified that CI/CD configurations, in many cases, changed unnecessarily due to bad practices such as the direct inclusion of dependencies and a lack of usage of standardized testing frameworks. More practices were found through the change patterns analysis consisting of using deprecated settings and reliance on a generic build language. Finally, our developer's expertise analysis suggests that experienced developers are more inclined to modify CI/CD configurations.
翻译:随着机器学习(ML)的日益普及以及ML组件与其他软件工件的集成,持续集成与交付(CI/CD)工具(如Travis CI、GitHub Actions等)被广泛用于加速ML项目的集成与测试。此类CI/CD配置与服务需要在项目生命周期中保持同步。已有研究探讨了传统软件系统中CI/CD配置与服务在使用过程中的变化规律,然而关于ML项目中CI/CD配置与服务的演化机制仍知之甚少。为填补这一知识空白,本研究首次对ML软件系统中CI/CD配置的演化过程进行实证分析。我们手动分析了来自508个开源ML项目的343次提交,识别出ML项目中常见的CI/CD配置变更类别,并构建了包含14种CI/CD与ML组件协同变更类型的分类体系。进一步,我们开发了CI/CD配置变更聚类工具,在15,634次提交中识别出频繁出现的CI/CD配置变更模式。此外,我们还评估了修改CI/CD配置的ML开发者专业水平。基于分析发现:61.8%的提交涉及构建策略变更,但与通用开源项目相比,涉及性能与可维护性的变更极少。协同演化分析表明,由于不良实践(如直接包含依赖项、缺乏标准化测试框架的使用),许多情况下CI/CD配置出现了不必要的变更。通过变更模式分析,我们还发现了使用已弃用设置、依赖通用构建语言等更多实践问题。最后,开发者专业水平分析表明,经验丰富的开发者更倾向于修改CI/CD配置。