Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen .
翻译:代码语言模型在周围代码提供足够上下文时表现良好。但当需要引用仓库或链接库中其他地方定义的类型、功能或API(尤其是在训练中未曾见过的内容)时,这一假设不成立。语言模型对这类全局上下文的感知有限,导致产生幻觉。集成开发环境通过静态分析帮助开发者理解仓库上下文。我们将这种开发者享有的辅助能力扩展到语言模型中,提出监控引导解码方法(MGD),其中监控器利用静态分析指导解码过程。我们构建了仓库级数据集PragmaticCode用于Java方法补全,并在该数据集上评估MGD。通过监控类型一致的对象解引用,MGD在不同参数规模模型上持续提升了编译成功率和与真实结果的一致性。此外,参数较少的语言模型在增强MGD后,其性能可超越更大规模的模型。采用MGD的SantaCoder-1.1B在编译成功率和下一标识符匹配方面优于更大的text-davinci-003模型。我们还进行了泛化性研究,评估MGD在多编程语言(Java、C#和Rust)、多编码场景(例如方法调用参数数量正确性)以及更丰富语义约束(例如状态化API协议)上的泛化能力。我们的数据和实现见 https://github.com/microsoft/monitors4codegen。