Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub.
翻译:当前,众多行业对数据分析、数据挖掘和机器学习等数据科学技能有着迫切需求。计算笔记本(如Jupyter Notebook)是实践中广泛采用的知名数据科学工具。Kaggle和GitHub是数据科学社区用于知识共享、技能实践与协作的两大平台。尽管这两个平台都提供了面向初学者的数据科学教程与指南,但获得社区高票数支持的Jupyter Notebook数量仍然有限。高票笔记本通常被认为具备文档完善、易于理解的特点,并遵循最佳数据科学与软件工程实践。本研究旨在探究Kaggle平台高票Jupyter Notebook的特征,以及GitHub数据科学项目中流行Jupyter Notebook的特性。我们计划对两个平台的Jupyter Notebook进行挖掘与分析,通过探索性分析、数据可视化和特征重要性评估,理解这些笔记本的整体结构,并识别区分低票与高票笔记本的共性模式与最佳实践特征。本研究完成后,所发现的洞见可为有志提升技能的数据科学家和机器学习从业者提供训练指导,帮助他们实现从Kaggle新手级Jupyter Notebook到GitHub可部署项目的进阶。