Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to transform a large open-source legal corpus into a dataset supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.
翻译:法律专业人士需要撰写依赖相关判例(即先前案例判决)引用的分析报告。协助法律专业人士撰写此类文件的智能系统具有巨大价值,但设计难度较高。此类系统需要帮助定位、总结并对关键判例进行推理才能发挥作用。为实现此类任务的系统开发,我们与法律专业人士合作,将大规模开源法律语料库转化为支持两项核心任务的数据集:信息检索(IR)与检索增强生成(RAG)。该数据集CLERC(案例法评估检索语料库)旨在训练和评估模型在以下两方面的能力:(1)为给定法律分析片段查找对应判例引用;(2)将这些引用文本(及先前上下文)整合成支持推理目标的连贯分析报告。我们在CLERC上对前沿模型进行基准测试,结果表明当前方法仍面临挑战:GPT-4o生成的分析报告在ROUGE F值上得分最高,但虚构内容最多;而零样本IR模型仅达到48.3%的召回率@1000。