LawBench: Benchmarking Legal Knowledge of Large Language Models

Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in https://github.com/open-compass/LawBench/. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain.

翻译：大型语言模型（LLMs）已在多个领域展现出强大能力。然而，当将其应用于高度专业化且安全敏感的法律领域时，尚不清楚它们具备多少法律知识以及能否可靠执行法律相关任务。为弥合这一空白，我们提出全面评估基准LawBench。LawBench经过精心设计，能够从三个认知层面对LLMs法律能力进行精准评估：（1）法律知识记忆：模型能否记住所需法律概念、条款与事实；（2）法律知识理解：模型能否理解法律文本中的实体、事件及关系；（3）法律知识应用：模型能否恰当运用法律知识并通过必要推理步骤解决真实法律任务。LawBench包含20个多样化任务，涵盖5种任务类型：单标签分类（SLC）、多标签分类（MLC）、回归、抽取与生成。我们对51个LLMs（含20个多语言LLMs、22个中文聚焦LLMs及9个法律专用LLMs）在LawBench上进行了广泛评估。结果表明，GPT-4在法律领域仍为性能最优模型，显著超越其他模型。尽管在法律专用文本上微调LLMs可带来一定改进，但在法律任务中获得可用且可靠的LLMs仍任重道远。所有数据、模型预测及评估代码均已发布于https://github.com/open-compass/LawBench/。我们期待该基准能深化对LLMs领域专长能力的理解，并加速法律领域LLMs的发展。