EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

from arxiv, Code and dataset available at https://github.com/onyx-dot-app/EnterpriseRAG-Bench or https://huggingface.co/datasets/onyx-dot-app/EnterpriseRAG-Bench

Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information is absent. The generation framework lets teams generate variants tailored to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are available at https://github.com/onyx-dot-app/EnterpriseRAG-Bench.

翻译：[translated abstract in Chinese] 检索增强生成已成为将大语言模型与训练期间未包含的信息进行锚定的标准方法。尽管现有数据集和基准测试主要聚焦于网络或其他公开来源，但目前尚缺乏能真实反映企业内部知识本质的广泛采用的数据集。与此同时，初创企业、大型企业及研究人员正日益开发设计用于处理此类专有数据的AI智能体。为弥合这一差距，我们发布了一个合成企业语料库、其生成框架以及一个排行榜。我们提出EnterpriseRAG-Bench，该数据集包含约50万份文档，涵盖九种企业源类型（Slack、Gmail、Linear、Google Drive、HubSpot、Fireflies、GitHub、Jira和Confluence），以及500个问题，这些问题横跨十个类别，用于测试不同的检索与推理能力。该语料库通过跨文档一致性（基于共享项目、人员和计划）生成，并辅以真实噪声（如错误归档文档、近似重复内容和矛盾信息）进行增强。问题集涵盖从简单的单文档查找、多文档推理、受约束检索、冲突解决，到识别信息缺失等场景。该生成框架使团队能够生成针对自身行业、规模及数据源组合定制的变体。数据集、代码、评估工具与排行榜均可在https://github.com/onyx-dot-app/EnterpriseRAG-Bench获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【AAAI2026】TruthfulRAG：基于知识图谱解决检索增强生成中的事实层冲突

专知会员服务

22+阅读 · 2025年11月15日

检索增强生成（RAG）技术，261页slides

专知会员服务

42+阅读 · 2025年10月16日

【新书】Essential GraphRAG: 知识图谱增强的RAG

专知会员服务

35+阅读 · 2025年7月17日

【新书】检索增强生成（RAG）入门指南

专知会员服务

30+阅读 · 2025年6月25日