We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
翻译:本文提出Legal RAG Bench,这是一个用于评估法律RAG系统端到端性能的基准测试与评估方法体系。作为基准测试集,Legal RAG Bench包含来自《维多利亚刑事指控手册》的4,876个法律条文段落,以及100个需要刑法与诉讼程序专业知识的复杂人工构建问题,同时提供详尽的长文本答案及其对应支撑条文。作为评估方法体系,Legal RAG Bench采用全因子实验设计及创新的层次化误差分解框架,实现了对RAG系统中检索模块与推理模型贡献度的标准化对比。我们评估了三种前沿嵌入模型(Isaacus的Kanon 2 Embedder、Google的Gemini Embedding 001和OpenAI的Text Embedding 3 Large)与两种尖端大语言模型(Gemini 3.1 Pro和GPT-5.2),发现信息检索是法律RAG性能的核心驱动因素,而大语言模型对答案正确性与事实依据性的影响相对有限。其中Kanon 2 Embedder对性能提升贡献最为显著:平均正确率提升17.5个百分点,事实依据性提升4.5个百分点,检索准确率提升34个百分点。我们观察到,法律RAG系统中许多被归因于幻觉生成的错误实际上源于检索失败,由此得出结论:检索性能为多数现代法律RAG系统的表现设定了上限。本文系统阐述了Legal RAG Bench的构建原理与方法,并公开评估结果。同时,我们开源全部代码与数据以支持研究复现。