PLawBench：基于评分标准的基准测试，用于评估大语言模型在真实法律实践中的表现 (PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice)

Yuzhen Shi,Huanghai Liu,Yiran Hu,Gaojie Song,Xinran Xu,Yubo Ma,Tianyi Tang,Li Zhang,Qingjing Chen,Di Feng,Wenbo Lv,Weiheng Wu,Kexin Yang,Sen Yang,Wei Wang,Rongyao Shi,Yuanyang Qiu,Yuemeng Qi,Jingwen Zhang,Xiaoyu Sui,Yifan Chen,Yi Zhang,An Yang,Bowen Yu,Dayiheng Liu,Junyang Lin,Weixing Shen,Bing Zhao,Charles L. A. Clarke,Hu Wei

As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.

翻译：随着大语言模型（LLMs）日益应用于法律领域特定任务，评估其在真实场景中执行法律工作的能力变得至关重要。然而，现有法律基准测试依赖于简化且高度标准化的任务，未能捕捉真实法律实践中的模糊性、复杂性及推理需求。此外，先前评估常采用粗粒度、单一维度的度量标准，未能明确评估细粒度的法律推理能力。为应对这些局限，我们提出了PLawBench（实用法律基准测试），旨在真实法律实践场景中评估大语言模型。该基准基于真实世界法律工作流程，通过三大任务类别模拟法律从业者的核心工作过程：公共法律咨询、实务案例分析与法律文书生成。这些任务评估模型识别法律问题与关键事实、进行结构化法律推理以及生成法律逻辑连贯文书的能力。PLawBench包含13个实务法律场景下的850道问题，每道问题均配有专家设计的评估评分标准，形成约12,500个细粒度评估项。通过使用与人类专家判断对齐的基于LLM的评估器，我们对10个前沿大语言模型进行了评估。实验结果表明，所有模型在PLawBench上均未表现出强劲性能，揭示了当前大语言模型在细粒度法律推理能力上的显著局限，并为未来法律大语言模型的评估与发展指明了重要方向。数据发布于：https://github.com/skylenage/PLawbench。