The growing popularity of machine learning (ML) and the integration of ML components with other software artifacts has led to the use of continuous integration and delivery (CI/CD) tools, such as Travis CI, GitHub Actions, etc. that enable faster integration and testing for ML projects. Such CI/CD configurations and services require synchronization during the life cycle of the projects. Several works discussed how CI/CD configuration and services change during their usage in traditional software systems. However, there is very limited knowledge of how CI/CD configuration and services change in ML projects. To fill this knowledge gap, this work presents the first empirical analysis of how CI/CD configuration evolves for ML software systems. We manually analyzed 343 commits collected from 508 open-source ML projects to identify common CI/CD configuration change categories in ML projects and devised a taxonomy of 14 co-changes in CI/CD and ML components. Moreover, we developed a CI/CD configuration change clustering tool that identified frequent CI/CD configuration change patterns in 15,634 commits. Furthermore, we measured the expertise of ML developers who modify CI/CD configurations. Based on this analysis, we found that 61.8% of commits include a change to the build policy and minimal changes related to performance and maintainability compared to general open-source projects. Additionally, the co-evolution analysis identified that CI/CD configurations, in many cases, changed unnecessarily due to bad practices such as the direct inclusion of dependencies and a lack of usage of standardized testing frameworks. More practices were found through the change patterns analysis consisting of using deprecated settings and reliance on a generic build language. Finally, our developer's expertise analysis suggests that experienced developers are more inclined to modify CI/CD configurations.
翻译:随着机器学习(ML)的日益普及以及ML组件与其他软件工件的集成,持续集成与交付(CI/CD)工具(如Travis CI、GitHub Actions等)被广泛用于加速ML项目的集成与测试。此类CI/CD配置与服务需要在项目生命周期中保持同步。已有研究探讨了传统软件系统中CI/CD配置与服务在使用过程中的变化规律,但关于ML项目中CI/CD配置与服务如何演化的认知仍十分有限。为填补这一研究空白,本文首次对ML软件系统中CI/CD配置的演化过程进行实证分析。我们手动分析了来自508个开源ML项目的343次提交记录,归纳出ML项目中常见的CI/CD配置变更类别,并建立了包含14种CI/CD与ML组件共变关系的分类体系。在此基础上,我们开发了CI/CD配置变更聚类工具,对15,634次提交中频繁出现的CI/CD配置变更模式进行了识别。此外,我们还测算了修改CI/CD配置的ML开发者的专业水平。基于分析发现:相较于一般开源项目,61.8%的提交涉及构建策略变更,而与性能及可维护性相关的变更占比极低;共演化分析表明,许多场景下CI/CD配置的变更是由于不良实践(如直接包含依赖项、缺乏标准化测试框架的使用)所导致的非必要修改。通过变更模式分析,我们还发现了使用废弃配置项、依赖通用构建语言等更多实践问题。最后,开发者专长分析表明,经验丰富的开发者更倾向于修改CI/CD配置。