Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.
翻译:大型语言模型(LLMs)在作为专业人士进行法律分析的写作辅助工具方面展现出潜力。然而,在此类应用中,LLMs常常会产生幻觉,这些幻觉难以被非专业人士和现有的文本评估指标所识别。在本研究中,我们提出一个问题:何时可以将机器生成的法律分析评估为可接受的?我们引入了一个中性的概念——差距,以区别于严格错误意义上的幻觉,用以指代人类撰写与机器生成的法律分析之间的差异。差距并不总是等同于无效的生成。通过与法律专家合作,我们基于Hou等人(2024b)提出的CLERC生成任务,构建了一个分类法、一个用于预测差距类别的细粒度检测器,以及一个用于自动评估的标注数据集。我们最佳的检测器在测试集上达到了67%的F1分数和80%的精确率。将该检测器作为自动评估指标应用于由SOTA LLMs生成的法律分析时,我们发现约80%的分析包含不同类型的幻觉。