Lately, propelled by the phenomenal advances around the transformer architecture, the legal NLP field has enjoyed spectacular growth. To measure progress, well curated and challenging benchmarks are crucial. However, most benchmarks are English only and in legal NLP specifically there is no multilingual benchmark available yet. Additionally, many benchmarks are saturated, with the best models clearly outperforming the best humans and achieving near perfect scores. We survey the legal NLP literature and select 11 datasets covering 24 languages, creating LEXTREME. To provide a fair comparison, we propose two aggregate scores, one based on the datasets and one on the languages. The best baseline (XLM-R large) achieves both a dataset aggregate score a language aggregate score of 61.3. This indicates that LEXTREME is still very challenging and leaves ample room for improvement. To make it easy for researchers and practitioners to use, we release LEXTREME on huggingface together with all the code required to evaluate models and a public Weights and Biases project with all the runs.
翻译:论文摘要:近年来,受Transformer架构革命性突破的推动,法律自然语言处理领域取得了显著进展。为衡量发展水平,精心设计且具有挑战性的基准测试至关重要。然而,现有基准大多局限于英语,尤其在法律NLP领域,尚缺乏多语言基准。此外,许多基准已趋于饱和——最佳模型的表现显著超越人类顶尖水平,并达到近乎完美的分数。我们系统梳理法律NLP文献,筛选出覆盖24种语言的11个数据集,构建了LEXTREME基准。为确保公平比较,我们提出两种聚合评分指标:基于数据集的评分和基于语言的评分。最优基线模型(XLM-R large)在数据集聚合评分与语言聚合评分中均达到61.3分,表明LEXTREME仍具有极高挑战性,亟待进一步提升。为方便研究者与实践者使用,我们在HuggingFace平台发布LEXTREME基准,同时提供完整的模型评估代码及包含所有实验记录的Weights & Biases公共项目。