In software engineering, numerous studies have focused on the analysis of fine-grained logs, leading to significant innovations in areas such as refactoring, security, and code completion. However, no similar studies have been conducted for computational notebooks in the context of data science. To help bridge this research gap, we make three scientific contributions: we (1) introduce a toolset for collecting code changes in Jupyter notebooks during development time; (2) use it to collect more than 100 hours of work related to a data analysis task and a machine learning task (carried out by 20 developers with different levels of expertise), resulting in a dataset containing 2,655 cells and 9,207 cell executions; and (3) use this dataset to investigate the dynamic nature of the notebook development process and the changes that take place in the notebooks. In our analysis of the collected data, we classified the changes made to the cells between executions and found that a significant number of these changes were relatively small fixes and code iteration modifications. This suggests that notebooks are used not only as a development and exploration tool but also as a debugging tool. We report a number of other insights and propose potential future research directions on the novel data.
翻译:在软件工程领域,众多研究聚焦于细粒度日志分析,从而在代码重构、安全性和代码补全等方面取得了重要创新。然而,在数据科学背景下,尚未有对计算笔记本(computational notebooks)开展类似研究。为填补这一研究空白,我们做出三项科学贡献:(1)开发了一套用于收集Jupyter Notebook开发过程中代码变更的工具集;(2)利用该工具集收集了超过100小时的数据分析任务和机器学习任务开发数据(由20名不同专业水平的开发者完成),构建了包含2,655个代码单元和9,207次单元执行的数据集;(3)基于该数据集探究了笔记本开发过程的动态特性及其内容变更规律。通过对收集数据的分析,我们对代码单元在连续执行间的变更进行了分类,发现其中大量变更是相对较小的修复和代码迭代修改。这表明笔记本不仅被用作开发和探索工具,同时也承担着调试工具的功能。我们报告了若干其他发现,并基于这一新型数据提出了潜在的未来研究方向。