Large language models (LLMs) have made significant progress in natural language processing tasks and demonstrate considerable potential in the legal domain. However, legal applications demand high standards of accuracy, reliability, and fairness. Applying existing LLMs to legal systems without careful evaluation of their potential and limitations could pose significant risks in legal practice. To this end, we introduce a standardized comprehensive Chinese legal benchmark LexEval. This benchmark is notable in the following three aspects: (1) Ability Modeling: We propose a new taxonomy of legal cognitive abilities to organize different tasks. (2) Scale: To our knowledge, LexEval is currently the largest Chinese legal evaluation dataset, comprising 23 tasks and 14,150 questions. (3) Data: we utilize formatted existing datasets, exam datasets and newly annotated datasets by legal experts to comprehensively evaluate the various capabilities of LLMs. LexEval not only focuses on the ability of LLMs to apply fundamental legal knowledge but also dedicates efforts to examining the ethical issues involved in their application. We evaluated 38 open-source and commercial LLMs and obtained some interesting findings. The experiments and findings offer valuable insights into the challenges and potential solutions for developing Chinese legal systems and LLM evaluation pipelines. The LexEval dataset and leaderboard are publicly available at \url{https://github.com/CSHaitao/LexEval} and will be continuously updated.
翻译:大语言模型(LLMs)在自然语言处理任务中取得了显著进展,并在法律领域展现出巨大潜力。然而,法律应用对准确性、可靠性和公平性有着高标准要求。若未经审慎评估其潜力与局限,便将现有LLMs应用于法律系统,可能会给法律实践带来重大风险。为此,我们引入了一个标准化的综合性中文法律基准LexEval。该基准在以下三个方面尤为突出:(1)能力建模:我们提出了一种新的法律认知能力分类法来组织不同任务。(2)规模:据我们所知,LexEval是目前最大的中文法律评估数据集,包含23个任务和14,150个问题。(3)数据:我们利用格式化的现有数据集、考试数据集以及由法律专家新标注的数据集,全面评估LLMs的各项能力。LexEval不仅关注LLMs应用基础法律知识的能力,还致力于考察其应用中所涉及的伦理问题。我们评估了38个开源及商业LLMs,并获得了一些有趣的发现。实验与发现为开发中文法律系统及LLM评估流程所面临的挑战与潜在解决方案提供了宝贵洞见。LexEval数据集与排行榜已在 \url{https://github.com/CSHaitao/LexEval} 公开,并将持续更新。