On the Anatomy of Real-World R Code for Static Analysis

CONTEXT The R programming language has a huge and active community, especially in the area of statistical computing. Its interpreted nature allows for several interesting constructs, like the manipulation of functions at run-time, that hinder the static analysis of R programs. At the same time, there is a lack of existing research regarding how these features, or even the R language as a whole are used in practice. OBJECTIVE In this paper, we conduct a large-scale, static analysis of more than 50 million lines of real-world R programs and packages to identify their characteristics and the features that are actually used. Moreover, we compare the similarities and differences between the scripts of R users and the implementations of package authors. We provide insights for static analysis tools like the lintr package as well as potential interpreter optimizations and uncover areas for future research. METHOD We analyze 4230 R scripts submitted alongside publications and the sources of 19450 CRAN packages for over 350000 R files, collecting and summarizing quantitative information for features of interest. RESULTS We find a high frequency of name-based indexing operations, assignments, and loops, but a low frequency for most of R's reflective functions. Furthermore, we find neither testing functions nor many calls to R's foreign function interface (FFI) in the publication submissions. CONCLUSION R scripts and package sources differ, for example, in their size, the way they include other packages, and their usage of R's reflective capabilities. We provide features that are used frequently and should be prioritized by static analysis tools, like operator assignments, function calls, and certain reflective functions like load.

翻译：背景：R编程语言拥有庞大且活跃的社区，尤其在统计计算领域。其解释型特性允许若干有趣的结构（如运行时函数操作），这些结构阻碍了R程序的静态分析。然而，现有研究对这类特性乃至整个R语言在实践中的应用方式尚缺乏系统性探索。目标：本文对超过5000万行真实世界R程序及包开展大规模静态分析，旨在识别其实际使用的特性与特征。同时，我们对比了R用户脚本与包作者实现之间的异同，为lintr等静态分析工具及潜在的解释器优化提供见解，并揭示未来研究方向。方法：我们分析了4230份随论文提交的R脚本和19450个CRAN包的源代码，涵盖超过35万个R文件，收集并总结了相关特征的量化信息。结果：我们发现名称索引操作、赋值和循环出现频率较高，而R的大多数反射函数使用频率较低。此外，论文提交的脚本中既未发现测试函数，也未发现对R外部函数接口（FFI）的大量调用。结论：R脚本与包源文件存在差异，例如在文件大小、包含其他包的方式以及对R反射能力的利用方面。我们提供了高频使用特征（如运算符赋值、函数调用及某些反射函数如load），这些特征应作为静态分析工具的优先处理对象。