Current state-of-the-art large language models are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. These models, however, often hallucinate and lack locally relevant factual data. Retrieval-augmented approaches were introduced to overcome these problems and provide more accurate responses. Typically, the retrieved information is simply appended to the main request, restricting the context window size of the model. We propose a novel approach for the Dynamic Retrieval-Augmented Generation (DRAG), based on the entity-augmented generation, which injects compressed embeddings of the retrieved entities into the generative model. The proposed pipeline was developed for code-generation tasks, yet can be transferred to some domains of natural language processing. To train the model, we collect and publish a new project-level code generation dataset. We use it for the evaluation along with publicly available datasets. Our approach achieves several targets: (1) lifting the length limitations of the context window, saving on the prompt size; (2) allowing huge expansion of the number of retrieval entities available for the context; (3) alleviating the problem of misspelling or failing to find relevant entity names. This allows the model to beat all baselines (except GPT-3.5) with a strong margin.
翻译:当前最先进的大语言模型在生成高质量文本和封装广泛的世界知识方面表现出色。然而,这些模型经常产生幻觉且缺乏局部相关的事实数据。检索增强方法被引入以克服这些问题并提供更准确的响应。通常,检索到的信息仅被附加到主要请求中,从而限制了模型的上下文窗口大小。我们提出了一种基于实体增强生成的新型动态检索增强生成(DRAG)方法,该方法将检索到的实体的压缩嵌入注入到生成模型中。所提出的流水线是为代码生成任务开发的,但也能够迁移到自然语言处理的某些领域。为训练模型,我们收集并发布了一个新的项目级代码生成数据集。我们将其与公开可用的数据集一起用于评估。我们的方法实现了多个目标:(1)消除上下文窗口的长度限制,节省提示大小;(2)允许极大扩展上下文中可用的检索实体数量;(3)缓解拼写错误或无法找到相关实体名称的问题。这使得模型能够以显著优势击败所有基线(GPT-3.5除外)。