Unraveling Code Clone Dynamics in Deep Learning Frameworks

Deep Learning (DL) frameworks play a critical role in advancing artificial intelligence, and their rapid growth underscores the need for a comprehensive understanding of software quality and maintainability. DL frameworks, like other systems, are prone to code clones. Code clones refer to identical or highly similar source code fragments within the same project or even across different projects. Code cloning can have positive and negative implications for software development, influencing maintenance, readability, and bug propagation. In this paper, we aim to address the knowledge gap concerning the evolutionary dimension of code clones in DL frameworks and the extent of code reuse across these frameworks. We empirically analyze code clones in nine popular DL frameworks, i.e., TensorFlow, Paddle, PyTorch, Aesara, Ray, MXNet, Keras, Jax and BentoML, to investigate (1) the characteristics of the long-term code cloning evolution over releases in each framework, (2) the short-term, i.e., within-release, code cloning patterns and their influence on the long-term trends, and (3) the file-level code clones within the DL frameworks. Our findings reveal that DL frameworks adopt four distinct cloning trends and that these trends present some common and distinct characteristics. For instance, bug-fixing activities persistently happen in clones irrespective of the clone evolutionary trend but occur more in the "Serpentine" trend. Moreover, the within release level investigation demonstrates that short-term code cloning practices impact long-term cloning trends. The cross-framework code clone investigation reveals the presence of functional and architectural adaptation file-level cross-framework code clones across the nine studied frameworks. We provide insights that foster robust clone practices and collaborative maintenance in the development of DL frameworks.

翻译：深度学习（DL）框架在推动人工智能发展方面发挥着关键作用，其快速发展的态势要求我们全面理解软件质量与可维护性。与其他系统类似，DL框架也容易出现代码克隆现象。代码克隆指同一项目或不同项目间完全相同或高度相似的源代码片段，其对软件开发具有积极与消极的双重影响，涉及维护性、可读性及缺陷传播等维度。本文旨在填补关于DL框架中代码克隆演化维度及框架间代码复用程度的知识空白。我们选取TensorFlow、Paddle、PyTorch、Aesara、Ray、MXNet、Keras、Jax和BentoML九个主流DL框架进行实证分析，重点研究：（1）各框架随版本演进的长期代码克隆演化特征；（2）短期（即版本内）代码克隆模式及其对长期趋势的影响；（3）DL框架内的文件级代码克隆。研究揭示DL框架呈现出四种不同的克隆演化趋势，这些趋势既存在共性特征又具有独特差异。例如，缺陷修复活动在各类克隆演化趋势中持续发生，但在"蛇形"趋势中更为频繁。此外，版本内层面的分析表明短期代码克隆实践会影响长期克隆趋势。跨框架代码克隆调查显示，九个研究框架间存在功能性与架构适应性两类文件级跨框架代码克隆。本研究为构建DL框架开发中的稳健克隆实践与协同维护提供了重要启示。