LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.

翻译：大型语言模型（LLMs）正越来越多地被整合到法律起草和研究工作流中，其中错误的引用或捏造的判例可能造成严重的职业损害。现有的法律基准主要强调成文法推理、合同理解或一般性法律问答，但并未直接研究一个关键的普通法失效模式：当要求模型在没有外部依据支持的情况下提供判例权威时，模型可能返回看似合理但错误的引用或案例。我们提出了LegalCiteBench，一个用于研究法律语言模型中闭卷引用恢复、引用验证和案件匹配的基准。LegalCiteBench包含约24,000个评估实例，这些实例基于案例法访问项目中1,000个真实的美国司法意见构建而成。该基准涵盖五个以引用为中心的任务：引用检索、引用补全、引用错误检测、案件匹配以及案件验证与纠正。在21个LLMs中，精确的引用恢复在闭卷环境下仍然极具挑战性：即使是表现最强的模型，在引用检索和补全任务上的得分也低于7/100。在评估的模型中，模型规模和法律领域预训练带来的提升有限，并未解决这一难点。此外，在我们的评估协议下，模型经常提供具体但错误或重叠度低的权威引用，在检索密集型任务中，21个评估模型中有20个的误导性回答率（MAR）超过94%。一项仅含提示的回避实验表明，明确的不确定性指示减少了部分自信的捏造行为，但并未提高引用的正确性。LegalCiteBench旨在作为一个诊断框架，用于研究当外部依据缺失、不完整或被绕过时，关于权威生成失败、验证行为以及模型回避行为的问题。