Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.

翻译：检索增强生成（RAG）系统是大语言模型（LLM）在工业界广泛应用的一种形式。尽管现有众多工具支持开发者构建自有系统，但如何利用反映系统实际用例的数据集进行本地性能评估仍是一项技术挑战。针对该问题的解决方案覆盖了从非特定且低成本的方案（大多数公共数据集）到特定但高成本的方案（基于本地文档生成数据）。本文研究表明，使用公共问答（Q&A）数据集评估检索性能可能导致次优的系统设计，而常见的RAG数据集生成工具可能产生不平衡数据。我们通过标签化表征RAG数据集以及基于标签定向生成数据的方法，提出了解决这些问题的方案。最后，我们证明经过微调的小型LLM能够高效生成Q&A数据集。我们相信这些发现对RAG系统开发中“了解你的数据”这一关键步骤具有重要价值。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日