Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.
翻译:检索增强生成(RAG)系统是大语言模型(LLM)在工业界广泛应用的一种形式。尽管现有众多工具支持开发者构建自有系统,但如何利用反映系统实际用例的数据集进行本地性能评估仍是一项技术挑战。针对该问题的解决方案覆盖了从非特定且低成本的方案(大多数公共数据集)到特定但高成本的方案(基于本地文档生成数据)。本文研究表明,使用公共问答(Q&A)数据集评估检索性能可能导致次优的系统设计,而常见的RAG数据集生成工具可能产生不平衡数据。我们通过标签化表征RAG数据集以及基于标签定向生成数据的方法,提出了解决这些问题的方案。最后,我们证明经过微调的小型LLM能够高效生成Q&A数据集。我们相信这些发现对RAG系统开发中“了解你的数据”这一关键步骤具有重要价值。