Large language models (LLMs) have the potential to transform the practice of law, but this potential is threatened by the presence of legal hallucinations -- responses from these models that are not consistent with legal facts. We investigate the extent of these hallucinations using an original suite of legal queries, comparing LLMs' responses to structured legal metadata and examining their consistency. Our work makes four key contributions: (1) We develop a typology of legal hallucinations, providing a conceptual framework for future research in this area. (2) We find that legal hallucinations are alarmingly prevalent, occurring between 69% of the time with ChatGPT 3.5 and 88% with Llama 2, when these models are asked specific, verifiable questions about random federal court cases. (3) We illustrate that LLMs often fail to correct a user's incorrect legal assumptions in a contra-factual question setup. (4) We provide evidence that LLMs cannot always predict, or do not always know, when they are producing legal hallucinations. Taken together, these findings caution against the rapid and unsupervised integration of popular LLMs into legal tasks. Even experienced lawyers must remain wary of legal hallucinations, and the risks are highest for those who stand to benefit from LLMs the most -- pro se litigants or those without access to traditional legal resources.
翻译:大型语言模型(LLMs)具有革新法律实践的潜力,但这一潜力受到法律幻觉现象的威胁——即这些模型生成与法律事实不符的回应。通过设计一套原创的法律查询集,我们对比了LLMs的回应与结构化法律元数据,并检验其一致性,从而系统探究了这些幻觉的程度。本研究做出四项核心贡献:(1)提出法律幻觉的类型学分类,为该领域的未来研究构建概念框架;(2)发现法律幻觉现象惊人地普遍——当模型被问及随机联邦法院案件的具体可验证问题时,ChatGPT 3.5的幻觉发生率达69%,而Llama 2则高达88%;(3)揭示在反事实提问场景中,LLMs通常无法纠正用户错误的法律假设;(4)提供证据表明LLMs并非总能预测或意识到自身正产生法律幻觉。综合而言,这些发现警示人们不应在缺乏监管的情况下仓促将主流LLMs应用于法律任务。即使是经验丰富的律师也必须警惕法律幻觉,而那些最需要LLMs帮助的群体——如无律师代理的诉讼当事人或无法获取传统法律资源的人——面临的风险最为严峻。