An Alternative to Cells for Selective Execution of Data Science Pipelines - 专知论文

会员服务 ·

0

单元 · 数据科学 · 笔记本电脑 · 表格数据 · 代码 ·

2023 年 4 月 7 日

An Alternative to Cells for Selective Execution of Data Science Pipelines

翻译：数据科学管道选择性执行的单元替代方案

Lars Reimann,Günter Kniesel-Wünsche

from arxiv, Accepted for the NIER Track of the 45th International Conference on Software Engineering (ICSE 2023)

Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in this paper: (1) Notebooks can become littered with code cells that are not part of the main DS pipeline but exist solely to make decisions (e.g. listing the columns of a tabular dataset). (2) While users are allowed to execute cells in any order, not every ordering is correct, because a cell can depend on declarations from other cells. (3) After making changes to a cell, this cell and all cells that depend on changed declarations must be rerun. (4) Changes to external values necessitate partial re-execution of the notebook. (5) Since cells are the smallest unit of execution, code that is unaffected by changes, can inadvertently be re-executed. To solve these issues, we propose to replace cells as the basis for the selective execution of DS pipelines. Instead, we suggest populating a context-menu for variables with actions fitting their type (like listing columns if the variable is a tabular dataset). These actions are executed based on a data-flow analysis to ensure dependencies between variables are respected and results are updated properly after changes. Our solution separates pipeline code from decision making code and automates dependency management, thus reducing clutter and the risk of making errors.

翻译：数据科学家常使用笔记本开发数据科学管道，因其允许选择性执行管道的部分内容。然而用于数据科学的笔记本存在诸多公认缺陷。本文重点关注以下问题：(1) 笔记本可能充斥着不属于主数据科学管道、仅用于决策的代码单元（例如列出表格数据集的列名）。(2) 虽然用户可按任意顺序执行单元，但并非所有顺序都正确，因为单元可能依赖其他单元的声明。(3) 修改某单元后，该单元及所有依赖修改后声明的单元必须重新执行。(4) 外部值的变化需要笔记本的部分重新执行。(5) 由于单元是最小执行单位，不受变更影响的代码可能被意外重新执行。为解决这些问题，我们提出用替代方案取代单元作为数据科学管道选择性执行的基础。具体而言，建议在变量上下文菜单中填充适配其类型的操作（如对表格数据集变量执行"列出列名"操作）。这些操作基于数据流分析执行，确保尊重变量间的依赖关系，并在变更后正确更新结果。本方案将管道代码与决策代码分离，并自动化依赖管理，从而减少混乱和出错风险。

0

相关内容

【2022新书】Python数据分析第三版，579页pdf

【2022新书】Python数据分析第三版，579页pdf

专知会员服务

257+阅读 · 2022年8月31日

【开放书】《经济与金融数据科学》，357页pdf，欧盟委员会联合研究中心，Data Science for Economics and Finance

【开放书】《经济与金融数据科学》，357页pdf，欧盟委员会联合研究中心，Data Science for Economics and Finance

专知会员服务

42+阅读 · 2022年3月24日

【经典书】线性代数，436页pdf

专知会员服务

79+阅读 · 2021年3月16日

【2020新书】算法与数据结构实战，286页pdf，Algorithms Data Structures in Action

【2020新书】算法与数据结构实战，286页pdf，Algorithms Data Structures in Action

专知会员服务

107+阅读 · 2020年2月22日

【新书】用Python六步掌握机器学习，第二版，469页pdf，使用Python进行预测数据分析的实用实现指南Mastering Machine Learning with Python in Six Steps, 2nd Edition A Practical Implementation Guide to Predictive Data Analytics Using Python

【新书】用Python六步掌握机器学习，第二版，469页pdf，使用Python进行预测数据分析的实用实现指南Mastering Machine Learning with Python in Six Steps, 2nd Edition A Practical Implementation Guide to Predictive Data Analytics Using Python

专知会员服务

89+阅读 · 2020年2月2日

【芝加哥大学】GRAPH-BERT: Only Attention is Needed for Learning Graph Representations

【芝加哥大学】GRAPH-BERT: Only Attention is Needed for Learning Graph Representations

专知会员服务

85+阅读 · 2020年1月15日

【Python最佳实践、技巧与提示30则】《30 Python Best Practices, Tips, And Tricks》by Erik-Jan van Baaren

【Python最佳实践、技巧与提示30则】《30 Python Best Practices, Tips, And Tricks》by Erik-Jan van Baaren

专知会员服务

35+阅读 · 2020年1月6日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

【博士论文】自然语言处理的神经图嵌入方法，Neural Graph Embedding methods for Natural Language Processing

【博士论文】自然语言处理的神经图嵌入方法，Neural Graph Embedding methods for Natural Language Processing

专知会员服务

80+阅读 · 2019年11月5日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

10 个数据分析师必须知道的 SQL 查询语法

10 个数据分析师必须知道的 SQL 查询语法

CSDN

0+阅读 · 2022年9月13日

是否应该在 Kubernetes 上运行数据库？

是否应该在 Kubernetes 上运行数据库？

CSDN

0+阅读 · 2022年9月1日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Github项目推荐 | gensim - Python中的主题建模

Github项目推荐 | gensim - Python中的主题建模

AI研习社

15+阅读 · 2019年3月16日

安装TensorFlow 2.0 preview进行深度学习（附Jupyter Notebook）

安装TensorFlow 2.0 preview进行深度学习（附Jupyter Notebook）

专知

10+阅读 · 2019年1月11日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】(Python)多种模型(Naive Bayes, SVM, CNN, LSTM, etc)实现推文情感分析

【推荐】(Python)多种模型(Naive Bayes, SVM, CNN, LSTM, etc)实现推文情感分析

机器学习研究会

13+阅读 · 2017年12月25日

深度学习医学图像分析文献集

深度学习医学图像分析文献集

机器学习研究会

19+阅读 · 2017年10月13日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

双链嵌合microRNA用于治疗肝细胞癌的研究

国家自然科学基金

0+阅读 · 2015年12月31日

外包数据的密文存储及查询的关键技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

云计算环境下基于运行时模型的管理复用关键技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

ERG介导组蛋白修饰调控CRMP4失活启动前列腺癌转移的分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

半量子密码学中若干关键问题研究

国家自然科学基金

0+阅读 · 2012年12月31日

循环设计以及相关编码的组合构造研究

国家自然科学基金

0+阅读 · 2012年12月31日

激活成纤维细胞改善移植胰岛的再血管化

国家自然科学基金

0+阅读 · 2009年12月31日

基于粗糙集理论的入侵检测方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

适应多类型Insider Attack的入侵检测与精确定位方法的研究

国家自然科学基金

0+阅读 · 2008年12月31日

曲古菌素A对人类体细胞核移植胚胎表观遗传重编程的影响

国家自然科学基金

0+阅读 · 2008年12月31日

A Framework for Incentivized Collaborative Learning

Arxiv

0+阅读 · 2023年5月26日

Sources of Uncertainty in Machine Learning -- A Statisticians' View

Arxiv

0+阅读 · 2023年5月26日

Everyone's Preference Changes Differently: Weighted Multi-Interest Retrieval Model

Arxiv

0+阅读 · 2023年5月25日

Ground(less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making

Arxiv

0+阅读 · 2023年5月25日

EXACT: Extensive Attack for Split Learning

Arxiv

0+阅读 · 2023年5月25日

MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation

Arxiv

0+阅读 · 2023年5月25日

Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study

Arxiv

0+阅读 · 2023年5月24日

Aesthetics, Personalization and Recommendation: A survey on Deep Learning in Fashion

Arxiv

13+阅读 · 2021年1月20日

Ripple Network: Propagating User Preferences on the Knowledge Graph for Recommender Systems

Arxiv

14+阅读 · 2018年5月19日

Learning over Knowledge-Base Embeddings for Recommendation

Arxiv

23+阅读 · 2018年3月22日

VIP会员

文章信息

相关主题

笔记本电脑

最新内容

《反无人机蜂群：有人-无人协同防御场景下的编队重构分析》

《反无人机蜂群：有人-无人协同防御场景下的编队重构分析》

专知会员服务

6+阅读 · 7月24日

《史诗怒火/咆哮雄狮行动：针对伊朗空中战役的战略分析》68页智库报告

《史诗怒火/咆哮雄狮行动：针对伊朗空中战役的战略分析》68页智库报告

专知会员服务

5+阅读 · 7月24日

“愈演愈烈的欺骗与干扰博弈”：无人机与人工智能背景下俄乌强化以无人机为核心的电子战

“愈演愈烈的欺骗与干扰博弈”：无人机与人工智能背景下俄乌强化以无人机为核心的电子战

专知会员服务

3+阅读 · 7月24日

乌克兰纵深打击如何重塑俄罗斯的战略选择

乌克兰纵深打击如何重塑俄罗斯的战略选择

专知会员服务

2+阅读 · 7月24日

《分布式太空任务对比分析与综合建模及仿真环境》120页

《分布式太空任务对比分析与综合建模及仿真环境》120页

专知会员服务

2+阅读 · 7月24日

俄乌战争中关于中程打击无人机部署的经验启示

俄乌战争中关于中程打击无人机部署的经验启示

专知会员服务

1+阅读 · 7月24日

《远程自主系统可扩展态势感知的解决方案》32页2026最新报告

《远程自主系统可扩展态势感知的解决方案》32页2026最新报告

专知会员服务

5+阅读 · 7月23日

《基于强化学习的自动化红队测试》

《基于强化学习的自动化红队测试》

专知会员服务

4+阅读 · 7月23日

《下一代无人机-卫星通信：人工智能创新与未来展望》32页长综述

《下一代无人机-卫星通信：人工智能创新与未来展望》32页长综述

专知会员服务

6+阅读 · 7月23日

“天降毒雾”：无人机如何使化学战重返乌克兰战场

“天降毒雾”：无人机如何使化学战重返乌克兰战场

专知会员服务

2+阅读 · 7月23日

伊朗不对称防空战略的演进

伊朗不对称防空战略的演进

专知会员服务

4+阅读 · 7月23日

对抗环境下超视距目标打击的情报支援

对抗环境下超视距目标打击的情报支援

专知会员服务

11+阅读 · 7月22日

《面向复杂地形下无人机跟踪地面机器人（UAV–UGV）的自适应多滤波器扩展卡尔曼滤波框架》

《面向复杂地形下无人机跟踪地面机器人（UAV–UGV）的自适应多滤波器扩展卡尔曼滤波框架》

专知会员服务

4+阅读 · 7月22日

纵深侦察：大规模作战行动中远程侦察与监视之迫切需求

纵深侦察：大规模作战行动中远程侦察与监视之迫切需求

专知会员服务

8+阅读 · 7月22日

共享认知，分布式研判：复杂行动中的美国空军指挥控制（万字长文）

共享认知，分布式研判：复杂行动中的美国空军指挥控制（万字长文）

专知会员服务

11+阅读 · 7月22日

相关VIP内容

【2022新书】Python数据分析第三版，579页pdf

【2022新书】Python数据分析第三版，579页pdf

专知会员服务

257+阅读 · 2022年8月31日

【开放书】《经济与金融数据科学》，357页pdf，欧盟委员会联合研究中心，Data Science for Economics and Finance

【开放书】《经济与金融数据科学》，357页pdf，欧盟委员会联合研究中心，Data Science for Economics and Finance

专知会员服务

42+阅读 · 2022年3月24日

【经典书】线性代数，436页pdf

专知会员服务

79+阅读 · 2021年3月16日

【2020新书】算法与数据结构实战，286页pdf，Algorithms Data Structures in Action

【2020新书】算法与数据结构实战，286页pdf，Algorithms Data Structures in Action

专知会员服务

107+阅读 · 2020年2月22日

【新书】用Python六步掌握机器学习，第二版，469页pdf，使用Python进行预测数据分析的实用实现指南Mastering Machine Learning with Python in Six Steps, 2nd Edition A Practical Implementation Guide to Predictive Data Analytics Using Python

【新书】用Python六步掌握机器学习，第二版，469页pdf，使用Python进行预测数据分析的实用实现指南Mastering Machine Learning with Python in Six Steps, 2nd Edition A Practical Implementation Guide to Predictive Data Analytics Using Python

专知会员服务

89+阅读 · 2020年2月2日

【芝加哥大学】GRAPH-BERT: Only Attention is Needed for Learning Graph Representations

【芝加哥大学】GRAPH-BERT: Only Attention is Needed for Learning Graph Representations

专知会员服务

85+阅读 · 2020年1月15日

【Python最佳实践、技巧与提示30则】《30 Python Best Practices, Tips, And Tricks》by Erik-Jan van Baaren

【Python最佳实践、技巧与提示30则】《30 Python Best Practices, Tips, And Tricks》by Erik-Jan van Baaren

专知会员服务

35+阅读 · 2020年1月6日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

【博士论文】自然语言处理的神经图嵌入方法，Neural Graph Embedding methods for Natural Language Processing

【博士论文】自然语言处理的神经图嵌入方法，Neural Graph Embedding methods for Natural Language Processing

专知会员服务

80+阅读 · 2019年11月5日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《史诗怒火/咆哮雄狮行动：针对伊朗空中战役的战略分析》68页智库报告

乌克兰纵深打击如何重塑俄罗斯的战略选择

《反无人机蜂群：有人-无人协同防御场景下的编队重构分析》

“愈演愈烈的欺骗与干扰博弈”：无人机与人工智能背景下俄乌强化以无人机为核心的电子战

相关资讯

10 个数据分析师必须知道的 SQL 查询语法

10 个数据分析师必须知道的 SQL 查询语法

CSDN

0+阅读 · 2022年9月13日

是否应该在 Kubernetes 上运行数据库？

是否应该在 Kubernetes 上运行数据库？

CSDN

0+阅读 · 2022年9月1日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Github项目推荐 | gensim - Python中的主题建模

Github项目推荐 | gensim - Python中的主题建模

AI研习社

15+阅读 · 2019年3月16日

安装TensorFlow 2.0 preview进行深度学习（附Jupyter Notebook）

安装TensorFlow 2.0 preview进行深度学习（附Jupyter Notebook）

专知

10+阅读 · 2019年1月11日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】(Python)多种模型(Naive Bayes, SVM, CNN, LSTM, etc)实现推文情感分析

【推荐】(Python)多种模型(Naive Bayes, SVM, CNN, LSTM, etc)实现推文情感分析

机器学习研究会

13+阅读 · 2017年12月25日

深度学习医学图像分析文献集

深度学习医学图像分析文献集

机器学习研究会

19+阅读 · 2017年10月13日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

A Framework for Incentivized Collaborative Learning

Arxiv

0+阅读 · 2023年5月26日

Sources of Uncertainty in Machine Learning -- A Statisticians' View

Arxiv

0+阅读 · 2023年5月26日

Everyone's Preference Changes Differently: Weighted Multi-Interest Retrieval Model

Arxiv

0+阅读 · 2023年5月25日

Ground(less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making

Arxiv

0+阅读 · 2023年5月25日

EXACT: Extensive Attack for Split Learning

Arxiv

0+阅读 · 2023年5月25日

MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation

Arxiv

0+阅读 · 2023年5月25日

Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study

Arxiv

0+阅读 · 2023年5月24日

Aesthetics, Personalization and Recommendation: A survey on Deep Learning in Fashion

Arxiv

13+阅读 · 2021年1月20日

Ripple Network: Propagating User Preferences on the Knowledge Graph for Recommender Systems

Arxiv

14+阅读 · 2018年5月19日

Learning over Knowledge-Base Embeddings for Recommendation

Arxiv

23+阅读 · 2018年3月22日

相关基金

双链嵌合microRNA用于治疗肝细胞癌的研究

国家自然科学基金

0+阅读 · 2015年12月31日

外包数据的密文存储及查询的关键技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

云计算环境下基于运行时模型的管理复用关键技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

ERG介导组蛋白修饰调控CRMP4失活启动前列腺癌转移的分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

半量子密码学中若干关键问题研究

国家自然科学基金

0+阅读 · 2012年12月31日

循环设计以及相关编码的组合构造研究

国家自然科学基金

0+阅读 · 2012年12月31日

激活成纤维细胞改善移植胰岛的再血管化

国家自然科学基金

0+阅读 · 2009年12月31日

基于粗糙集理论的入侵检测方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

适应多类型Insider Attack的入侵检测与精确定位方法的研究

国家自然科学基金

0+阅读 · 2008年12月31日

曲古菌素A对人类体细胞核移植胚胎表观遗传重编程的影响

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员