This paper examines how to make large language models reliable for high-stakes legal work by reducing hallucinations. It distinguishes three AI paradigms: (1) standalone generative models ("creative oracle"), (2) basic retrieval-augmented systems ("expert archivist"), and (3) an advanced, end-to-end optimized RAG system ("rigorous archivist"). The authors introduce two reliability metrics -False Citation Rate (FCR) and Fabricated Fact Rate (FFR)- and evaluate 2,700 judicial-style answers from 12 LLMs across 75 legal tasks using expert, double-blind review. Results show that standalone models are unsuitable for professional use (FCR above 30%), while basic RAG greatly reduces errors but still leaves notable misgrounding. Advanced RAG, using techniques such as embedding fine-tuning, re-ranking, and self-correction, reduces fabrication to negligible levels (below 0.2%). The study concludes that trustworthy legal AI requires rigor-focused, retrieval-based architectures emphasizing verification and traceability, and provides an evaluation framework applicable to other high-risk domains.
翻译:本文探讨如何通过减少幻觉使大语言模型适用于高风险法律工作,并区分了三种人工智能范式:(1) 独立生成模型("创意先知"),(2) 基础检索增强系统("专家档案员"),以及(3) 端到端优化的高级RAG系统("严谨档案员")。作者提出了两个可靠性指标——错误引证率(FCR)与虚构事实率(FFR),并采用专家双盲评审方式,对12个大语言模型在75项法律任务中生成的2,700份司法风格答案进行评估。结果表明:独立生成模型不适用于专业场景(FCR超过30%);基础RAG系统虽显著降低错误率,但仍存在明显的事实依据偏差;而采用嵌入微调、结果重排及自我修正等技术的高级RAG系统,能将虚构率降至可忽略水平(低于0.2%)。研究指出,可信赖的法律人工智能需要以严谨性为核心、强调验证与可追溯性的检索增强架构,同时提供了可推广至其他高风险领域的评估框架。