The current state-of-the-art large language models (LLMs) are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. However, these models often hallucinate during generation and are not designed to utilize external information sources. To enable requests to the external knowledge bases, also called knowledge grounding, retrieval-augmented LLMs were introduced. For now, their applications have largely involved Open Domain Question Answering, Abstractive Question Answering, and such. In this paper, we broaden the scope of retrieval-augmented LLMs by venturing into a new task - code generation using external entities. For this task, we collect and publish a new dataset for project-level code generation, where the model should reuse functions defined in the project during generation. As we show, existing retrieval-augmented LLMs fail to assign relevance scores between similar entity names, and to mitigate it, they expand entity names with description context and append it to the input. In practice, due to the limited context size they can not accommodate the indefinitely large context of the whole project. To solve this issue, we propose a novel end-to-end trainable architecture with an scalable entity retriever injected directly into the LLM decoder. We demonstrate that our model can outperform common baselines in several scenarios, including project-level code generation, as well as Bash and SQL scripting.
翻译:当前最先进的大语言模型在生成高质量文本和封装广泛的世界知识方面表现出色。然而,这些模型在生成过程中常出现幻觉,且未被设计用于利用外部信息源。为实现对外部知识库的查询(即知识锚定),引入了检索增强型大语言模型。目前,其应用主要涉及开放域问答、抽象式问答等领域。本文通过探索一项新任务——利用外部实体进行代码生成——拓展了检索增强型大语言模型的应用范围。为此,我们收集并发布了一个用于项目级代码生成的新数据集,要求模型在生成过程中复用项目中定义的函数。研究表明,现有检索增强型大语言模型难以在相似实体名称间分配相关分数;为缓解此问题,它们会通过描述上下文扩展实体名称并将其附加到输入中。然而在实际应用中,受限于有限的上下文长度,模型无法容纳整个项目无限扩展的上下文。为解决这一问题,我们提出了一种新型端到端可训练架构,将可扩展的实体检索器直接注入大语言模型解码器。实验证明,我们的模型在项目级代码生成、Bash脚本及SQL脚本等多个场景下均能超越常见基线方法。