Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by retrieving supporting documents into the prompt, but existing methods do not explicitly target queries that require fetching multiple documents with substantially different content. Such multi-aspect queries are challenging because relevant documents can be far apart in embedding space, making joint retrieval difficult. We introduce Multi-Head RAG (MRAG), which addresses this gap with a simple yet powerful idea: using Transformer multi-head attention activations rather than the standard decoder-layer embedding, as retrieval keys. It leverages the observation that different heads capture different semantic aspects. This yields multi-aspect embeddings for both documents and queries, improving retrieval accuracy on complex queries. We show MRAG's design advantages over 18 RAG baselines, up to 20% higher retrieval success ratios for real-world use cases, and improved downstream LLM generation. MRAG integrates seamlessly with existing RAG frameworks and benchmarks.
翻译:检索增强生成(RAG)通过将支持性文档检索至提示中,改进了大语言模型(LLMs)的性能,但现有方法并未明确针对需要获取内容差异显著的多个文档的查询。此类多维度查询具有挑战性,因为相关文档在嵌入空间中可能相距甚远,使得联合检索变得困难。我们提出了多头检索增强生成(MRAG),它通过一个简单而强大的思想来解决这一不足:使用Transformer多头注意力激活而非标准的解码器层嵌入作为检索键。该方法利用了不同注意力头捕获不同语义维度的观察结果。这为文档和查询生成了多维度嵌入,从而提高了复杂查询的检索准确性。我们展示了MRAG相较于18个RAG基线模型的设计优势,在真实应用场景中检索成功率提升高达20%,并改善了后续LLM的生成质量。MRAG能够与现有的RAG框架和基准测试无缝集成。