Guiding Language Models of Code with Global Context using Monitors

Language models of code (LMs) work well when the surrounding code in the vicinity of generation provides sufficient context. This is not true when it becomes necessary to use types or functionality defined in another module or library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating, e.g., using types defined in other files incorrectly. Recent work tries to overcome this issue by retrieving global information to augment the local context. However, this bloats the prompt or requires architecture modifications and additional training. Integrated development environments (IDEs) assist developers by bringing the global context at their fingertips using static analysis. We extend this assistance, enjoyed by developers, to the LMs. We propose a notion of monitors that use static analysis in the background to guide the decoding. Unlike a priori retrieval, static analysis is invoked iteratively during the entire decoding process, providing the most relevant suggestions on demand. We demonstrate the usefulness of our proposal by monitoring for type-consistent use of identifiers whenever an LM generates code for object dereference. To evaluate our approach, we curate PragmaticCode, a dataset of open-source projects with their development environments. On models of varying parameter scale, we show that monitor-guided decoding consistently improves the ability of an LM to not only generate identifiers that match the ground truth but also improves compilation rates and agreement with ground truth. We find that LMs with fewer parameters, when guided with our monitor, can outperform larger LMs. With monitor-guided decoding, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. The datasets and code will be released at https://aka.ms/monitors4codegen .

翻译：代码语言模型在生成代码时，若其附近代码提供足够上下文，则表现良好。但当需要引用其他模块或库中定义的类型或功能（尤其是训练中未见过的情况）时，这种假设不再成立。语言模型对这类全局上下文的感知有限，导致产生幻觉，例如错误地使用其他文件中定义的类型。近期研究尝试通过检索全局信息增强局部上下文来解决此问题，但这会导致提示膨胀，或需要修改模型架构并进行额外训练。集成开发环境通过静态分析将全局上下文置于开发者指尖，从而辅助开发。我们将这种开发者享受到的辅助扩展至语言模型，提出一种“监视器”概念，它利用后台运行的静态分析引导解码过程。与先验检索不同，静态分析在完整解码过程中被迭代调用，按需提供最相关的建议。我们通过监控语言模型生成对象解引用代码时标识符的类型一致性使用，验证了该方案的有效性。为评估方法，我们构建了PragmaticCode数据集——包含开源项目及其开发环境的集合。在不同参数规模的模型上，我们发现监视器引导的解码不仅持续提升语言模型生成与真实值匹配的标识符的能力，还提高了编译通过率与真实值一致性。更小参数的语言模型在监视器引导下可超越更大模型。例如，采用监视器引导解码的SantaCoder-1.1B，其编译通过率和下一标识符匹配度均优于规模更大的text-davinci-003模型。数据集与代码将在https://aka.ms/monitors4codegen 发布。