Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.
翻译:检索增强生成(RAG)是一种有望缓解大型语言模型(LLM)在法律应用中产生幻觉的有效方法,但其可靠性关键取决于检索步骤的准确性。这在法律领域尤为具有挑战性,因为结构相似的海量文档数据库常导致检索系统失效。本文首先识别并量化了一种关键失效模式——我们称之为文档级检索失配(DRM),即检索器从完全错误的源文档中选取信息。为缓解DRM,我们研究了一种简单且计算高效的技术,称为摘要增强分块(SAC)。该方法通过为每个文本块添加文档级合成摘要进行增强,从而注入了标准分块过程中可能丢失的关键全局上下文。我们在多样化法律信息检索任务上的实验表明,SAC显著降低了DRM,并因此提高了文本级检索的精确率和召回率。有趣的是,我们发现通用摘要策略的表现优于融合法律专家领域知识以针对特定法律要素的方法。我们的工作证明,这种实用、可扩展且易于集成的技术能够有效提升RAG系统在应用于大规模法律文档数据集时的可靠性。