In software engineering, numerous studies have focused on the analysis of fine-grained logs, leading to significant innovations in areas such as refactoring, security, and code completion. However, no similar studies have been conducted for computational notebooks in the context of data science. To help bridge this research gap, we make three scientific contributions: we (1) introduce a toolset for collecting code changes in Jupyter notebooks during development time; (2) use it to collect more than 100 hours of work related to a data analysis task and a machine learning task (carried out by 20 developers with different levels of expertise), resulting in a dataset containing 2,655 cells and 9,207 cell executions; and (3) use this dataset to investigate the dynamic nature of the notebook development process and the changes that take place in the notebooks. In our analysis of the collected data, we classified the changes made to the cells between executions and found that a significant number of these changes were code iteration modifications. We report a number of other insights and propose potential future research directions on the novel data.
翻译:在软件工程领域,已有大量研究聚焦于细粒度日志分析,并在代码重构、安全防护与代码补全等方面取得了重要创新。然而,针对数据科学场景下的计算笔记本,尚未开展类似研究。为填补这一研究空白,我们作出三项科学贡献:(1)开发了一套用于采集Jupyter Notebook开发期间代码变更的工具集;(2)利用该工具集采集了超过100小时的数据分析任务与机器学习任务开发过程(由20位不同专业水平的开发者完成),构建了包含2,655个单元及9,207次单元执行的数据集;(3)基于该数据集探究了笔记本开发过程的动态特性及其发生的变更。通过对采集数据的分析,我们对单元执行间隔的变更进行了分类,发现其中大量变更为代码迭代修改。我们报告了若干其他发现,并基于这一新型数据提出了未来潜在的研究方向。