An Empirical Study of Self-Admitted Technical Debt in Machine Learning Software

The emergence of open-source ML libraries such as TensorFlow and Google Auto ML has enabled developers to harness state-of-the-art ML algorithms with minimal overhead. However, during this accelerated ML development process, said developers may often make sub-optimal design and implementation decisions, leading to the introduction of technical debt that, if not addressed promptly, can have a significant impact on the quality of the ML-based software. Developers frequently acknowledge these sub-optimal design and development choices through code comments during software development. These comments, which often highlight areas requiring additional work or refinement in the future, are known as self-admitted technical debt (SATD). This paper aims to investigate SATD in ML code by analyzing 318 open-source ML projects across five domains, along with 318 non-ML projects. We detected SATD in source code comments throughout the different project snapshots, conducted a manual analysis of the identified SATD sample to comprehend the nature of technical debt in the ML code, and performed a survival analysis of the SATD to understand the evolution of such debts. We observed: i) Machine learning projects have a median percentage of SATD that is twice the median percentage of SATD in non-machine learning projects. ii) ML pipeline components for data preprocessing and model generation logic are more susceptible to debt than model validation and deployment components. iii) SATDs appear in ML projects earlier in the development process compared to non-ML projects. iv) Long-lasting SATDs are typically introduced during extensive code changes that span multiple files exhibiting low complexity.

翻译：TensorFlow和Google Auto ML等开源机器学习库的出现，使开发者能够以极低的成本利用最先进的机器学习算法。然而，在此加速的机器学习开发过程中，开发者可能会做出次优的设计和实现决策，从而引入技术债务，若未能及时处理，将显著影响基于机器学习的软件质量。开发者通常通过在软件开发过程中添加代码注释来承认这些次优的设计与开发选择。这些注释常指出未来需要额外工作或改进的领域，被称为自我承认的技术债务（SATD）。本文旨在通过分析五个领域的318个开源ML项目及318个非ML项目，研究ML代码中的SATD。我们在不同项目快照的源代码注释中检测SATD，对识别出的SATD样本进行人工分析以理解ML代码中技术债务的性质，并对SATD进行生存分析以探究此类债务的演变规律。研究发现：i) 机器学习项目中SATD的中位数百分比是非机器学习项目的两倍；ii) 数据预处理和模型生成逻辑的ML流水线组件比模型验证和部署组件更易产生债务；iii) 与非ML项目相比，ML项目中的SATD更早出现在开发过程中；iv) 长期存在的SATD通常在涉及多个文件且复杂度较低的大规模代码变更期间引入。