Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in this paper: (1) Notebooks can become littered with code cells that are not part of the main DS pipeline but exist solely to make decisions (e.g. listing the columns of a tabular dataset). (2) While users are allowed to execute cells in any order, not every ordering is correct, because a cell can depend on declarations from other cells. (3) After making changes to a cell, this cell and all cells that depend on changed declarations must be rerun. (4) Changes to external values necessitate partial re-execution of the notebook. (5) Since cells are the smallest unit of execution, code that is unaffected by changes, can inadvertently be re-executed. To solve these issues, we propose to replace cells as the basis for the selective execution of DS pipelines. Instead, we suggest populating a context-menu for variables with actions fitting their type (like listing columns if the variable is a tabular dataset). These actions are executed based on a data-flow analysis to ensure dependencies between variables are respected and results are updated properly after changes. Our solution separates pipeline code from decision making code and automates dependency management, thus reducing clutter and the risk of making errors.
翻译:数据科学家常使用笔记本开发数据科学(DS)管道,尤其是因其支持选择性执行管道中的部分代码。然而,面向DS的笔记本存在诸多公认缺陷。本文聚焦以下问题:(1)笔记本中混杂大量非主DS管道的代码单元格,这些单元格仅用于决策(如列出表格数据集的列名);(2)尽管用户可按任意顺序执行单元格,但并非所有顺序都正确,因为单元格可能依赖其他单元格的声明;(3)修改某单元格后,该单元格及所有依赖被修改声明的单元格均需重新执行;(4)外部值的变更将触发笔记本的部分重新执行;(5)由于单元格是最小执行单元,未受变更影响的代码可能被意外重新执行。为应对这些问题,我们提出以基于数据流分析的变量类型匹配操作(如表格数据集变量可执行"列出列名"操作)替代传统单元格,作为DS管道选择性执行的基础。这些操作通过上下文菜单触发,并依据数据流分析执行,从而确保变量间依赖关系得到遵循,并在变更后正确更新结果。我们的解决方案将管道代码与决策代码分离,并实现依赖关系的自动化管理,从而减少代码冗余并降低出错风险。