Learning Type Inference for Enhanced Dataflow Analysis

Statically analyzing dynamically-typed code is a challenging endeavor, as even seemingly trivial tasks such as determining the targets of procedure calls are non-trivial without knowing the types of objects at compile time. Addressing this challenge, gradual typing is increasingly added to dynamically-typed languages, a prominent example being TypeScript that introduces static typing to JavaScript. Gradual typing improves the developer's ability to verify program behavior, contributing to robust, secure and debuggable programs. In practice, however, users only sparsely annotate types directly. At the same time, conventional type inference faces performance-related challenges as program size grows. Statistical techniques based on machine learning offer faster inference, but although recent approaches demonstrate overall improved accuracy, they still perform significantly worse on user-defined types than on the most common built-in types. Limiting their real-world usefulness even more, they rarely integrate with user-facing applications. We propose CodeTIDAL5, a Transformer-based model trained to reliably predict type annotations. For effective result retrieval and re-integration, we extract usage slices from a program's code property graph. Comparing our approach against recent neural type inference systems, our model outperforms the current state-of-the-art by 7.85% on the ManyTypes4TypeScript benchmark, achieving 71.27% accuracy overall. Furthermore, we present JoernTI, an integration of our approach into Joern, an open source static analysis tool, and demonstrate that the analysis benefits from the additional type information. As our model allows for fast inference times even on commodity CPUs, making our system available through Joern leads to high accessibility and facilitates security research.

翻译：静态分析动态类型代码是一项具有挑战性的任务，因为即使是确定过程调用目标这类看似简单的操作，在编译时若不知对象类型也并非易事。为应对这一挑战，渐进类型正被逐步引入动态类型语言，一个典型例子便是为JavaScript引入静态类型的TypeScript。渐进类型提升了开发者验证程序行为的能力，有助于构建健壮、安全且可调试的程序。然而在实践中，用户直接添加类型注解的情况较少。同时，传统类型推断在程序规模增大时面临性能挑战。基于机器学习的统计技术提供了更快的推断速度，尽管近期方法在整体准确率上有所提升，但它们在用户自定义类型上的表现仍显著差于最常见的内置类型。更限制其实际应用价值的是，这些方法很少与面向用户的应用程序集成。我们提出CodeTIDAL5，一个基于Transformer的模型，经训练可可靠预测类型注解。为有效检索结果并重新集成，我们从程序的代码属性图中提取使用切片。将我们的方法与近期神经类型推断系统对比，我们的模型在ManyTypes4TypeScript基准测试上以71.27%的整体准确率超越当前最优方法7.85%。此外，我们提出JoernTI，将我们的方法集成到开源静态分析工具Joern中，并证明该分析能从额外类型信息中获益。由于我们的模型即使在普通CPU上也能实现快速推断，通过Joern提供我们的系统将实现高可访问性，并促进安全研究。