Large language models (LLMs) have demonstrated remarkable performance in the legal domain, with GPT-4 even passing the Uniform Bar Exam in the U.S. However their efficacy remains limited for non-standardized tasks and tasks in languages other than English. This underscores the need for careful evaluation of LLMs within each legal system before application. Here, we introduce KBL, a benchmark for assessing the Korean legal language understanding of LLMs, consisting of (1) 7 legal knowledge tasks (510 examples), (2) 4 legal reasoning tasks (288 examples), and (3) the Korean bar exam (4 domains, 53 tasks, 2,510 examples). First two datasets were developed in close collaboration with lawyers to evaluate LLMs in practical scenarios in a certified manner. Furthermore, considering legal practitioners' frequent use of extensive legal documents for research, we assess LLMs in both a closed book setting, where they rely solely on internal knowledge, and a retrieval-augmented generation (RAG) setting, using a corpus of Korean statutes and precedents. The results indicate substantial room and opportunities for improvement.
翻译:大语言模型(LLMs)在法律领域已展现出卓越性能,GPT-4甚至通过了美国统一律师资格考试。然而,其在非标准化任务及英语以外语言任务中的效能仍然有限。这凸显了在应用前需针对每个法律体系对LLMs进行审慎评估的必要性。本文提出KBL基准,用于评估LLMs对韩国法律语言的理解能力,该基准包含:(1)7项法律知识任务(510个示例),(2)4项法律推理任务(288个示例),以及(3)韩国律师资格考试(4个领域,53项任务,2510个示例)。前两个数据集是与律师紧密合作开发的,旨在以经认证的方式评估LLMs在实际场景中的表现。此外,考虑到法律从业者经常使用大量法律文献进行研究,我们分别在闭卷设置(模型仅依赖内部知识)和检索增强生成(RAG)设置(使用韩国法规与判例语料库)下评估了LLMs。结果表明,模型性能仍有显著的提升空间与机遇。