Background: Extracting the stages that structure Machine Learning (ML) pipelines from source code is key for gaining a deeper understanding of data science practices. However, the diversity caused by the constant evolution of the ML ecosystem (e.g., algorithms, libraries, datasets) makes this task challenging. Existing approaches either depend on non-scalable, manual labeling, or on ML classifiers that do not properly support the diversity of the domain. These limitations highlight the need for more flexible and reliable solutions. Objective: We evaluate whether Small Language Models (SLMs) can leverage their code understanding and classification abilities to address these limitations, and subsequently how they can advance our understanding of data science practices. Method: We conduct a confirmatory study based on two reference works selected for their relevance regarding current state-of-the-art's limitations. First, we compare several SLMs using Cochran's Q test. The best-performing model is then evaluated against the reference studies using two distinct McNemar's tests. We further analyze how variations in taxonomy definitions affect performance through an additional Cochran's Q test. Finally, a goodness-of-fit analysis is conducted using Pearson's chi-squared tests to compare our insights on data science practices with those from prior studies.
翻译:背景:从源代码中提取构成机器学习(ML)流水线的各个阶段对于深入理解数据科学实践至关重要。然而,机器学习生态系统(如算法、库、数据集)的持续演进所带来的多样性使得这一任务极具挑战性。现有方法要么依赖于不可扩展的人工标注,要么依赖于未能充分支持该领域多样性的机器学习分类器。这些局限性凸显了对更灵活可靠解决方案的需求。目标:我们评估小型语言模型(SLMs)能否利用其代码理解与分类能力来应对这些局限性,并进而探究其如何推动我们对数据科学实践的理解。方法:我们基于两项因其与当前最先进方法局限性相关而被选为参考的文献开展了一项验证性研究。首先,我们使用Cochran's Q检验比较了多种SLMs。随后,将表现最佳的模型通过两项独立的McNemar's检验与参考研究进行对比评估。我们进一步通过额外的Cochran's Q检验分析了分类体系定义的变化如何影响性能。最后,采用Pearson卡方检验进行拟合优度分析,将我们对数据科学实践的见解与先前研究结果进行比较。