从代码变更到质量提升：基于PyQu的Python机器学习系统实证研究 (From Code Changes to Quality Gains: An Empirical Study in Python ML Systems with PyQu)

In an era shaped by Generative Artificial Intelligence for code generation and the rising adoption of Python-based Machine Learning systems (MLS), software quality has emerged as a major concern. As these systems grow in complexity and importance, a key obstacle lies in understanding exactly how specific code changes affect overall quality-a shortfall aggravated by the lack of quality assessment tools and a clear mapping between ML systems code changes and their quality effects. Although prior work has explored code changes in MLS, it mostly stops at what the changes are, leaving a gap in our knowledge of the relationship between code changes and the MLS quality. To address this gap, we conducted a large-scale empirical study of 3,340 open-source Python ML projects, encompassing more than 3.7 million commits and 2.7 trillion lines of code. We introduce PyQu, a novel tool that leverages low level software metrics to identify quality-enhancing commits with an average accuracy, precision, and recall of 0.84 and 0.85 of average F1 score. Using PyQu and a thematic analysis, we identified 61 code changes, each demonstrating a direct impact on enhancing software quality, and we classified them into 13 categories based on contextual characteristics. 41% of the changes are newly discovered by our study and have not been identified by state-of-the-art Python changes detection tools. Our work offers a vital foundation for researchers, practitioners, educators, and tool developers, advancing the quest for automated quality assessment and best practices in Python-based ML software.

翻译：在生成式人工智能推动代码生成、基于Python的机器学习系统日益普及的时代，软件质量已成为一个关键问题。随着这些系统在复杂性和重要性上的增长，理解特定代码变更如何影响整体质量成为主要障碍——这一困境因缺乏质量评估工具以及机器学习系统代码变更与其质量影响之间清晰映射关系而加剧。尽管已有研究探索了机器学习系统中的代码变更，但大多止步于变更内容的描述，未能揭示代码变更与机器学习系统质量之间的关联。为填补这一空白，我们对3,340个开源Python机器学习项目开展了大规模实证研究，涵盖超过370万次提交和2.7万亿行代码。我们提出了PyQu——一种利用底层软件度量指标识别质量提升型提交的新型工具，其平均准确率、精确率和召回率分别达到0.84，平均F1分数为0.85。通过PyQu和主题分析，我们识别出61种对软件质量提升具有直接影响的代码变更，并根据上下文特征将其归纳为13个类别。其中41%的变更为本研究首次发现，且未被当前最先进的Python变更检测工具识别。本研究为研究人员、从业者、教育工作者和工具开发者提供了重要基础，推动着基于Python的机器学习软件自动化质量评估与最佳实践的探索。