Learning Type Inference for Enhanced Dataflow Analysis

Statically analyzing dynamically-typed code is a challenging endeavor, as even seemingly trivial tasks such as determining the targets of procedure calls are non-trivial without knowing the types of objects at compile time. Addressing this challenge, gradual typing is increasingly added to dynamically-typed languages, a prominent example being TypeScript that introduces static typing to JavaScript. Gradual typing improves the developer's ability to verify program behavior, contributing to robust, secure and debuggable programs. In practice, however, users only sparsely annotate types directly. At the same time, conventional type inference faces performance-related challenges as program size grows. Statistical techniques based on machine learning offer faster inference, but although recent approaches demonstrate overall improved accuracy, they still perform significantly worse on user-defined types than on the most common built-in types. Limiting their real-world usefulness even more, they rarely integrate with user-facing applications. We propose CodeTIDAL5, a Transformer-based model trained to reliably predict type annotations. For effective result retrieval and re-integration, we extract usage slices from a program's code property graph. Comparing our approach against recent neural type inference systems, our model outperforms the current state-of-the-art by 7.85% on the ManyTypes4TypeScript benchmark, achieving 71.27% accuracy overall. Furthermore, we present JoernTI, an integration of our approach into Joern, an open source static analysis tool, and demonstrate that the analysis benefits from the additional type information. As our model allows for fast inference times even on commodity CPUs, making our system available through Joern leads to high accessibility and facilitates security research.

翻译：静态分析动态类型代码是一项具有挑战性的任务，因为即使像确定过程调用目标这样看似简单的任务，在编译时不知道对象类型的情况下也并非易事。为应对这一挑战，渐进类型被逐步引入动态类型语言，一个典型的例子是TypeScript，它为JavaScript引入了静态类型。渐进类型提升了开发者验证程序行为的能力，从而有助于构建健壮、安全且可调试的程序。然而，在实际应用中，用户直接标注类型的情况很少。与此同时，传统类型推断在程序规模增大时面临性能相关的挑战。基于机器学习的统计技术提供了更快的推断速度，但尽管近期的方法在整体准确性上有所提升，它们在用户自定义类型上的表现仍然明显差于最常见的内置类型。更限制其实际应用价值的是，它们很少与面向用户的应用程序集成。我们提出了CodeTIDAL5，这是一个基于Transformer的模型，经过训练能够可靠地预测类型标注。为了实现有效的结果检索和重新集成，我们从程序的代码属性图中提取使用切片。将我们的方法与最新的神经类型推断系统进行对比，我们的模型在ManyTypes4TypeScript基准测试上以71.27%的总体准确率超越了当前最先进的方法7.85%。此外，我们介绍了JoernTI，这是将我们的方法集成到开源静态分析工具Joern中的实现，并证明了分析从额外的类型信息中受益。由于我们的模型即使在普通CPU上也能实现快速的推断时间，因此通过Joern提供我们的系统具有高度的可访问性，并有助于安全研究。