R and Python are among the most popular languages used in many critical data analytics tasks. However, we still do not fully understand the capabilities of these two languages w.r.t. bugs encountered in data analytics tasks. What type of bugs are common? What are the main root causes? What is the relation between bugs and root causes? How to mitigate these bugs? We present a comprehensive study of 5,068 Stack Overflow posts, 1,800 bug fix commits from GitHub repositories, and several GitHub issues of the most used libraries to understand bugs in R and Python. Our key findings include: while both R and Python have bugs due to inexperience with data analysis, Python see significantly larger data preprocessing bugs compared to R. Developers experience significantly more data flow bugs in R because intermediate results are often implicit. We also found changes and bugs in packages and libraries cause more bugs in R compared to Python while package or library misselection and conflicts cause more bugs in Python than R. While R has a slightly higher readability barrier for data analysts, the statistical power of R leads to a less number of bad performance bugs. In terms of data visualization, R packages have significantly more bugs than Python libraries. We also identified a strong correlation between comparable packages in R and Python despite their linguistic and methodological differences. Lastly, we contribute a large dataset of manually verified R and Python bugs.
翻译:R语言和Python是用于诸多关键数据分析任务中最流行的编程语言之一。然而,我们对这两种语言在数据分析任务中遇到的缺陷仍缺乏全面认识:常见缺陷类型有哪些?主要根本原因是什么?缺陷与根本原因之间存在何种关联?如何缓解这些缺陷?本文通过对5,068篇Stack Overflow帖子、来自GitHub仓库的1,800个缺陷修复提交以及多个最常用库的GitHub问题单进行系统研究,深入分析了R语言和Python中的缺陷特征。主要发现包括:尽管两种语言均存在因数据分析经验不足导致的缺陷,但Python在数据预处理阶段出现的缺陷显著多于R语言。开发者在使用R语言时因中间结果常为隐式而遭遇更多数据流相关缺陷。研究发现,R语言中的包和库变更及缺陷引发的错误多于Python,而Python中包或库的选择不当及冲突导致的缺陷多于R语言。尽管R语言对数据分析人员存在略高的可读性门槛,但其统计能力优势使得性能不佳类缺陷数量较少。在数据可视化方面,R语言包的缺陷数量显著多于Python库。我们还发现,尽管R语言与Python在语言学和方法论上存在差异,其功能对等的包之间具有强相关性。最后,本研究贡献了一个包含人工验证的R语言和Python缺陷的大型数据集。