As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.
翻译:随着大语言模型(LLMs)在法律领域特定任务中的应用日益增多,评估其在真实场景中执行法律工作的能力变得至关重要。然而,现有的法律基准测试依赖于简化且高度标准化的任务,未能捕捉真实法律实践中的模糊性、复杂性及推理需求。此外,既往评估通常采用粗粒度、单一维度的度量标准,并未明确评估细粒度的法律推理能力。为应对这些局限,我们提出了PLawBench,一个旨在真实法律实践场景中评估大语言模型的实用法律基准。PLawBench基于真实世界的法律工作流程,通过三个任务类别模拟法律从业者的核心工作过程:公共法律咨询、实务案例分析和法律文书生成。这些任务评估模型识别法律问题与关键事实、进行结构化法律推理以及生成法律逻辑连贯的文书的能力。PLawBench包含13个实务法律场景下的850个问题,每个问题均配有专家设计的评估细则,总计产生约12,500个细则项用于细粒度评估。通过使用与人类专家判断对齐的基于大语言模型的评估器,我们对10个前沿大语言模型进行了评估。实验结果表明,没有模型在PLawBench上表现出色,这揭示了当前大语言模型在细粒度法律推理能力上的显著局限,并为未来法律大语言模型的评估与发展指明了重要方向。数据可在以下网址获取:https://github.com/skylenage/PLawbench。