The rapid advancements in Large Language Models (LLMs) have led to significant improvements in various natural language processing tasks. However, the evaluation of LLMs' legal knowledge, particularly in non-English languages such as Arabic, remains under-explored. To address this gap, we introduce ArabLegalEval, a multitask benchmark dataset for assessing the Arabic legal knowledge of LLMs. Inspired by the MMLU and LegalBench datasets, ArabLegalEval consists of multiple tasks sourced from Saudi legal documents and synthesized questions. In this work, we aim to analyze the capabilities required to solve legal problems in Arabic and benchmark the performance of state-of-the-art LLMs. We explore the impact of in-context learning and investigate various evaluation methods. Additionally, we explore workflows for generating questions with automatic validation to enhance the dataset's quality. We benchmark multilingual and Arabic-centric LLMs, such as GPT-4 and Jais, respectively. We also share our methodology for creating the dataset and validation, which can be generalized to other domains. We hope to accelerate AI research in the Arabic Legal domain by releasing the ArabLegalEval dataset and code: https://github.com/Thiqah/ArabLegalEval
翻译:大语言模型(LLM)的快速发展推动了各类自然语言处理任务的显著进步。然而,针对LLM法律知识评估的研究,尤其在阿拉伯语等非英语语言领域,仍处于探索不足的状态。为填补这一空白,我们提出了ArabLegalEval——一个用于评估LLM阿拉伯语法律知识的多任务基准数据集。该数据集受MMLU与LegalBench的启发,其多任务内容来源于沙特法律文件及人工合成的法律问题。本研究旨在系统分析处理阿拉伯语法律问题所需的核心能力,并对前沿LLM的性能进行基准测试。我们探究了上下文学习的影响机制,并检验了多种评估方法的有效性。此外,我们开发了具有自动验证功能的问题生成流程以提升数据集质量。实验中对多语言模型(如GPT-4)与阿拉伯语专用模型(如Jais)进行了全面评估。我们同时公开了数据集构建与验证的方法论框架,该框架可推广至其他专业领域。通过开源ArabLegalEval数据集与代码(https://github.com/Thiqah/ArabLegalEval),我们期望能加速阿拉伯语法律领域的人工智能研究进程。