Jupyter Notebook is an interactive development environment commonly used for rapid experimentation of machine learning (ML) solutions. Describing the ML activities performed along code cells improves the readability and understanding of Notebooks. Manual annotation of code cells is time-consuming and error-prone. Therefore, tools have been developed that classify the cells of a notebook concerning the ML activity performed in them. However, the current tools are not flexible, as they work based on look-up tables that have been created, which map function calls of commonly used ML libraries to ML activities. These tables must be manually adjusted to account for new or changed libraries. This paper presents a more flexible approach to cell classification based on a hybrid classification approach that combines a rule-based and a decision tree classifier. We discuss the design rationales and describe the developed classifiers in detail. We implemented the new flexible cell classification approach in a tool called JupyLabel. Its evaluation and the obtained metric scores regarding precision, recall, and F1-score are discussed. Additionally, we compared JupyLabel with HeaderGen, an existing cell classification tool. We were able to show that the presented flexible cell classification approach outperforms this tool significantly.
翻译:Jupyter Notebook是一种交互式开发环境,常用于机器学习解决方案的快速实验。描述代码单元格中执行的机器学习活动可提升Notebook的可读性和可理解性。手动标注代码单元格耗时且易出错,因此已有工具被开发用于对Notebook中执行机器学习活动的单元格进行分类。然而现有工具缺乏灵活性——它们基于预建的查找表运行,这些表将常用机器学习库的函数调用映射至机器学习活动,且必须手动调整才能适应新增或更新的库。本文提出一种基于混合分类方法(融合基于规则分类器与决策树分类器)的灵活单元格分类方案。我们讨论了设计原理,并详细阐述了所开发的分类器。该灵活分类方法已在名为JupyLabel的工具中实现。文中讨论了其评估结果及精确率、召回率与F1值的指标得分,同时将JupyLabel与现有单元格分类工具HeaderGen进行了对比。结果表明,本文提出的灵活单元格分类方法显著优于对比工具。