利用小型语言模型逆向工程机器学习流水线结构 (Using Small Language Models to Reverse-Engineer Machine Learning Pipelines Structures) - 专知论文

会员服务 ·

0

数据科学 · 小型语言模型 · 结构 · 代码 · 多样性 ·

Using Small Language Models to Reverse-Engineer Machine Learning Pipelines Structures

翻译：利用小型语言模型逆向工程机器学习流水线结构

Nicolas Lacroix,Mireille Blay-Fornarino,Sébastien Mosser,Frederic Precioso

from arxiv, SANER 2026 Registered Report

Background: Extracting the stages that structure Machine Learning (ML) pipelines from source code is key for gaining a deeper understanding of data science practices. However, the diversity caused by the constant evolution of the ML ecosystem (e.g., algorithms, libraries, datasets) makes this task challenging. Existing approaches either depend on non-scalable, manual labeling, or on ML classifiers that do not properly support the diversity of the domain. These limitations highlight the need for more flexible and reliable solutions. Objective: We evaluate whether Small Language Models (SLMs) can leverage their code understanding and classification abilities to address these limitations, and subsequently how they can advance our understanding of data science practices. Method: We conduct a confirmatory study based on two reference works selected for their relevance regarding current state-of-the-art's limitations. First, we compare several SLMs using Cochran's Q test. The best-performing model is then evaluated against the reference studies using two distinct McNemar's tests. We further analyze how variations in taxonomy definitions affect performance through an additional Cochran's Q test. Finally, a goodness-of-fit analysis is conducted using Pearson's chi-squared tests to compare our insights on data science practices with those from prior studies.

翻译：背景：从源代码中提取构成机器学习（ML）流水线的各个阶段对于深入理解数据科学实践至关重要。然而，机器学习生态系统（如算法、库、数据集）的持续演进所带来的多样性使得这一任务极具挑战性。现有方法要么依赖于不可扩展的人工标注，要么依赖于未能充分支持该领域多样性的机器学习分类器。这些局限性凸显了对更灵活可靠解决方案的需求。目标：我们评估小型语言模型（SLMs）能否利用其代码理解与分类能力来应对这些局限性，并进而探究其如何推动我们对数据科学实践的理解。方法：我们基于两项因其与当前最先进方法局限性相关而被选为参考的文献开展了一项验证性研究。首先，我们使用Cochran's Q检验比较了多种SLMs。随后，将表现最佳的模型通过两项独立的McNemar's检验与参考研究进行对比评估。我们进一步通过额外的Cochran's Q检验分析了分类体系定义的变化如何影响性能。最后，采用Pearson卡方检验进行拟合优度分析，将我们对数据科学实践的见解与先前研究结果进行比较。

0

相关内容

数据科学

数据科学（英語：data science）是一门利用数据学习知识的学科，其目标是通过从数据中提取出有价值的部分来生产数据产品。它结合了诸多领域中的理论和技术，包括应用数学、统计、模式识别、机器学习、数据可视化、数据仓库以及高性能计算。数据科学通过运用各种相关的数据来帮助非专业人士理解问题。

【CMU博士论文】利用结构化中间表示构建可靠且透明的机器学习系统

【CMU博士论文】利用结构化中间表示构建可靠且透明的机器学习系统

专知会员服务

28+阅读 · 2024年9月19日

《改进机器学习管道中的人类集成》人机协作最新263页论文

《改进机器学习管道中的人类集成》人机协作最新263页论文

专知会员服务

32+阅读 · 2024年8月13日

【博士论文】面向边缘智能的高效微型机器学习系统，212页pdf

【博士论文】面向边缘智能的高效微型机器学习系统，212页pdf

专知会员服务

60+阅读 · 2024年2月25日

【2023新书】机器学习中的表示学习，101页pdf

【2023新书】机器学习中的表示学习，101页pdf

专知会员服务

129+阅读 · 2023年2月3日

最新《自动化机器学习》报告，73页ppt建模阐述AutoML进展，附书籍

最新《自动化机器学习》报告，73页ppt建模阐述AutoML进展，附书籍

专知会员服务

114+阅读 · 2022年8月26日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【新书】机器学习算法，模型与应用，154页pdf

【新书】机器学习算法，模型与应用，154页pdf

专知会员服务

98+阅读 · 2022年1月20日

多伦多大学Fall2020《机器学习导论》课程，不可错过！

专知会员服务

55+阅读 · 2020年10月11日

多伦多大学最新《机器学习导论》课程，Introduction to Machine Learning

多伦多大学最新《机器学习导论》课程，Introduction to Machine Learning

专知会员服务

25+阅读 · 2020年9月24日

【电子书推荐】机器学习课程，A Course in Machine Learning，Hal Daumé III

【电子书推荐】机器学习课程，A Course in Machine Learning，Hal Daumé III

专知会员服务

28+阅读 · 2019年11月19日

【干货书】MLOps是什么？MLOps实战：操作机器学习模型，461页pdf

【干货书】MLOps是什么？MLOps实战：操作机器学习模型，461页pdf

专知

15+阅读 · 2022年2月16日

【2022新书】机器学习基础，225页pdf，Machine Learning The Basics

【2022新书】机器学习基础，225页pdf，Machine Learning The Basics

专知

13+阅读 · 2022年1月27日

【新书】机器学习算法，模型与应用，154页pdf

【新书】机器学习算法，模型与应用，154页pdf

专知

24+阅读 · 2022年1月20日

【开放书】MLOps导论：构建企业机器学习生产系统，185页pdf

【开放书】MLOps导论：构建企业机器学习生产系统，185页pdf

专知

51+阅读 · 2021年4月8日

【Manning2020新书】R/mlr机器学习，513页pdf，Machine Learning with R

【Manning2020新书】R/mlr机器学习，513页pdf，Machine Learning with R

专知

69+阅读 · 2020年3月7日

《可解释的机器学习-interpretable-ml》中文翻译版

《可解释的机器学习-interpretable-ml》中文翻译版

专知

88+阅读 · 2020年2月23日

概述自动机器学习（AutoML）

概述自动机器学习（AutoML）

人工智能学家

19+阅读 · 2019年8月11日

【机器学习】机器学习工业领域应用

【机器学习】机器学习工业领域应用

产业智能官

11+阅读 · 2018年10月23日

Databricks 开源 MLflow 平台，解决机器学习开发四大难点

Databricks 开源 MLflow 平台，解决机器学习开发四大难点

AI研习社

13+阅读 · 2018年6月8日

推荐｜TensorFlow/PyTorch/Sklearn实现的五十种机器学习模型

推荐｜TensorFlow/PyTorch/Sklearn实现的五十种机器学习模型

全球人工智能

24+阅读 · 2017年7月14日

基于略图挖掘的在不同时空域的网络流式数据实时处理

国家自然科学基金

1+阅读 · 2015年12月31日

复杂环境下机器学习的理论研究

国家自然科学基金

21+阅读 · 2015年12月31日

分布式有监督学习的学习理论

国家自然科学基金

17+阅读 · 2015年12月31日

面向大数据的安全迁移学习方法

国家自然科学基金

31+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

工业过程动态数据的多模型在线重构研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向大规模数据流的集成学习模型与方法研究

国家自然科学基金

5+阅读 · 2014年12月31日

基于结构学习的非平行支持向量机最优化方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于逆向强化学习和人工智能的移动机器人自主学习方法研究

国家自然科学基金

12+阅读 · 2013年12月31日

开放动态环境下在线机器学习理论与方法

国家自然科学基金

11+阅读 · 2013年12月31日

Quality Model for Machine Learning Components

Arxiv

0+阅读 · 2月4日

BTGenBot-2: Efficient Behavior Tree Generation with Small Language Models

Arxiv

0+阅读 · 2月2日

Multimodal Scientific Learning Beyond Diffusions and Flows

Arxiv

0+阅读 · 2月1日

Modeling Sampling Workflows for Code Repositories

Arxiv

0+阅读 · 1月27日

Riemannian AmbientFlow: Towards Simultaneous Manifold Learning and Generative Modeling from Corrupted Data

Arxiv

0+阅读 · 1月26日

Learning to Ideate for Machine Learning Engineering Agents

Arxiv

0+阅读 · 1月24日

Fine-Grained Traceability for Transparent ML Pipelines

Arxiv

0+阅读 · 1月21日

Towards Reliable ML Feature Engineering via Planning in Constrained-Topology of LLM Agents

Arxiv

0+阅读 · 1月15日

Interpretable Hybrid Machine Learning Models Using FOLD-R++ and Answer Set Programming

Arxiv

0+阅读 · 1月7日

Machine Learning Model Integration with Open World Temporal Logic for Process Automation

Arxiv

0+阅读 · 1月7日

VIP会员

文章信息

相关主题

小型语言模型

相关VIP内容

【CMU博士论文】利用结构化中间表示构建可靠且透明的机器学习系统

【CMU博士论文】利用结构化中间表示构建可靠且透明的机器学习系统

专知会员服务

28+阅读 · 2024年9月19日

《改进机器学习管道中的人类集成》人机协作最新263页论文

《改进机器学习管道中的人类集成》人机协作最新263页论文

专知会员服务

32+阅读 · 2024年8月13日

【博士论文】面向边缘智能的高效微型机器学习系统，212页pdf

【博士论文】面向边缘智能的高效微型机器学习系统，212页pdf

专知会员服务

60+阅读 · 2024年2月25日

【2023新书】机器学习中的表示学习，101页pdf

【2023新书】机器学习中的表示学习，101页pdf

专知会员服务

129+阅读 · 2023年2月3日

最新《自动化机器学习》报告，73页ppt建模阐述AutoML进展，附书籍

最新《自动化机器学习》报告，73页ppt建模阐述AutoML进展，附书籍

专知会员服务

114+阅读 · 2022年8月26日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【新书】机器学习算法，模型与应用，154页pdf

【新书】机器学习算法，模型与应用，154页pdf

专知会员服务

98+阅读 · 2022年1月20日

多伦多大学Fall2020《机器学习导论》课程，不可错过！

专知会员服务

55+阅读 · 2020年10月11日

多伦多大学最新《机器学习导论》课程，Introduction to Machine Learning

多伦多大学最新《机器学习导论》课程，Introduction to Machine Learning

专知会员服务

25+阅读 · 2020年9月24日

【电子书推荐】机器学习课程，A Course in Machine Learning，Hal Daumé III

【电子书推荐】机器学习课程，A Course in Machine Learning，Hal Daumé III

专知会员服务

28+阅读 · 2019年11月19日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】基于自适应表征的高效视觉建模

《多域作战中融合网络、电子战与动能机动》

AI智能体时代大模型安全风险与攻防新挑战

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

相关资讯

【干货书】MLOps是什么？MLOps实战：操作机器学习模型，461页pdf

【干货书】MLOps是什么？MLOps实战：操作机器学习模型，461页pdf

专知

15+阅读 · 2022年2月16日

【2022新书】机器学习基础，225页pdf，Machine Learning The Basics

【2022新书】机器学习基础，225页pdf，Machine Learning The Basics

专知

13+阅读 · 2022年1月27日

【新书】机器学习算法，模型与应用，154页pdf

【新书】机器学习算法，模型与应用，154页pdf

专知

24+阅读 · 2022年1月20日

【开放书】MLOps导论：构建企业机器学习生产系统，185页pdf

【开放书】MLOps导论：构建企业机器学习生产系统，185页pdf

专知

51+阅读 · 2021年4月8日

【Manning2020新书】R/mlr机器学习，513页pdf，Machine Learning with R

【Manning2020新书】R/mlr机器学习，513页pdf，Machine Learning with R

专知

69+阅读 · 2020年3月7日

《可解释的机器学习-interpretable-ml》中文翻译版

《可解释的机器学习-interpretable-ml》中文翻译版

专知

88+阅读 · 2020年2月23日

概述自动机器学习（AutoML）

概述自动机器学习（AutoML）

人工智能学家

19+阅读 · 2019年8月11日

【机器学习】机器学习工业领域应用

【机器学习】机器学习工业领域应用

产业智能官

11+阅读 · 2018年10月23日

Databricks 开源 MLflow 平台，解决机器学习开发四大难点

Databricks 开源 MLflow 平台，解决机器学习开发四大难点

AI研习社

13+阅读 · 2018年6月8日

推荐｜TensorFlow/PyTorch/Sklearn实现的五十种机器学习模型

推荐｜TensorFlow/PyTorch/Sklearn实现的五十种机器学习模型

全球人工智能

24+阅读 · 2017年7月14日

相关论文

Quality Model for Machine Learning Components

Arxiv

0+阅读 · 2月4日

BTGenBot-2: Efficient Behavior Tree Generation with Small Language Models

Arxiv

0+阅读 · 2月2日

Multimodal Scientific Learning Beyond Diffusions and Flows

Arxiv

0+阅读 · 2月1日

Modeling Sampling Workflows for Code Repositories

Arxiv

0+阅读 · 1月27日

Riemannian AmbientFlow: Towards Simultaneous Manifold Learning and Generative Modeling from Corrupted Data

Arxiv

0+阅读 · 1月26日

Learning to Ideate for Machine Learning Engineering Agents

Arxiv

0+阅读 · 1月24日

Fine-Grained Traceability for Transparent ML Pipelines

Arxiv

0+阅读 · 1月21日

Towards Reliable ML Feature Engineering via Planning in Constrained-Topology of LLM Agents

Arxiv

0+阅读 · 1月15日

Interpretable Hybrid Machine Learning Models Using FOLD-R++ and Answer Set Programming

Arxiv

0+阅读 · 1月7日

Machine Learning Model Integration with Open World Temporal Logic for Process Automation

Arxiv

0+阅读 · 1月7日

相关基金

基于略图挖掘的在不同时空域的网络流式数据实时处理

国家自然科学基金

1+阅读 · 2015年12月31日

复杂环境下机器学习的理论研究

国家自然科学基金

21+阅读 · 2015年12月31日

分布式有监督学习的学习理论

国家自然科学基金

17+阅读 · 2015年12月31日

面向大数据的安全迁移学习方法

国家自然科学基金

31+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

工业过程动态数据的多模型在线重构研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向大规模数据流的集成学习模型与方法研究

国家自然科学基金

5+阅读 · 2014年12月31日

基于结构学习的非平行支持向量机最优化方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于逆向强化学习和人工智能的移动机器人自主学习方法研究

国家自然科学基金

12+阅读 · 2013年12月31日

开放动态环境下在线机器学习理论与方法

国家自然科学基金

11+阅读 · 2013年12月31日

微信扫码咨询专知VIP会员