Type Prediction With Program Decomposition and Fill-in-the-Type Training

TypeScript and Python are two programming languages that support optional type annotations, which are useful but tedious to introduce and maintain. This has motivated automated type prediction: given an untyped program, produce a well-typed output program. Large language models (LLMs) are promising for type prediction, but there are challenges: fill-in-the-middle performs poorly, programs may not fit into the context window, generated types may not type check, and it is difficult to measure how well-typed the output program is. We address these challenges by building OpenTau, a search-based approach for type prediction that leverages large language models. We propose a new metric for type prediction quality, give a tree-based program decomposition that searches a space of generated types, and present fill-in-the-type fine-tuning for LLMs. We evaluate our work with a new dataset for TypeScript type prediction, and show that 47.4% of files type check (14.5% absolute improvement) with an overall rate of 3.3 type errors per file. All code, data, and models are available at: https://github.com/GammaTauAI/opentau.

翻译：TypeScript和Python是两种支持可选类型注解的编程语言，这一特性虽然实用但引入和维护较为繁琐。这促使了自动化类型预测的研究：给定一个无类型程序，生成类型正确的输出程序。大型语言模型（LLMs）在类型预测方面具有潜力，但面临挑战：中间填充方法表现不佳、程序可能超出上下文窗口、生成的类型可能无法通过类型检查、以及难以衡量输出程序的类型正确性。我们通过构建OpenTau——一种基于搜索且利用大型语言模型的类型预测方法——来应对这些挑战。我们提出了一种新的类型预测质量度量指标、一种基于树结构的程序分解方法以搜索生成类型的空间，以及面向LLM的类型填充微调方法。基于新构建的TypeScript类型预测数据集评估，我们实现了47.4%的文件通过类型检查（绝对提升14.5%），平均每个文件仅出现3.3个类型错误。所有代码、数据和模型均已开源：https://github.com/GammaTauAI/opentau。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/