Retrieval Augmented Generation (RAG) has become one of the most popular methods for bringing knowledge-intensive context to large language models (LLM) because of its ability to bring local context at inference time without the cost or data leakage risks associated with fine-tuning. A clear separation of private information from the LLM training has made RAG the basis for many enterprise LLM workloads as it allows the company to augment LLM's understanding using customers' private documents. Despite its popularity for private documents in enterprise deployments, current RAG benchmarks for validating and optimizing RAG pipelines draw their corpora from public data such as Wikipedia or generic web pages and offer little to no personal context. Seeking to empower more personal and private RAG we release the EnronQA benchmark, a dataset of 103,638 emails with 528,304 question-answer pairs across 150 different user inboxes. EnronQA enables better benchmarking of RAG pipelines over private data and allows for experimentation on the introduction of personalized retrieval settings over realistic data. Finally, we use EnronQA to explore the tradeoff in memorization and retrieval when reasoning over private documents.
翻译:检索增强生成(RAG)已成为为大型语言模型(LLM)引入知识密集型上下文的最流行方法之一,因其能够在推理时引入本地上下文,且无需承担微调相关的成本或数据泄露风险。由于将私有信息与LLM训练明确分离,RAG已成为许多企业LLM工作负载的基础,使企业能够利用客户的私有文档增强LLM的理解能力。尽管RAG在企业部署的私有文档处理中广受欢迎,但当前用于验证和优化RAG流程的基准测试均采用维基百科或通用网页等公共数据作为语料库,几乎不提供个性化上下文。为赋能更具个性化和隐私性的RAG,我们发布了EnronQA基准测试数据集,该数据集包含150个不同用户收件箱中的103,638封电子邮件,涵盖528,304个问答对。EnronQA能够更好地对基于私有数据的RAG流程进行基准测试,并支持在真实数据上开展个性化检索设置的实验研究。最后,我们利用EnronQA探讨了在私有文档推理过程中记忆与检索之间的权衡关系。