SCALE: Scaling up the Complexity for Advanced Language Model Evaluation

Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for novel, more challenging novel ones to properly assess LLM capabilities. In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark comprises diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying Non-English, inherently multilingual, federal legal system. Despite recent advances, efficiently processing long documents for intense review/analysis tasks remains an open challenge for language models. Also, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution's value, considering most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets (tens to hundreds of thousands of examples), existing publicly available models struggle with most tasks, even after in-domain pretraining. We publish all resources (benchmark suite, pre-trained models, code) under a fully permissive open CC BY-SA license.

翻译：大语言模型（LLMs）的最新进展已使许多自然语言处理基准测试（甚至包括专业领域基准）趋于饱和，这凸显了开发更具挑战性的新型基准以准确评估LLM能力的必要性。本文提出了一个新型自然语言处理基准，从四个关键维度对现有LLM构成挑战：长文档处理（高达50K词符）、领域专业知识应用（以法律文本为载体）、多语言理解（涵盖五种语言）以及多任务处理（包括法律文档间信息检索、法院观点生成、先例判决摘要、引文提取及八项高难度文本分类任务）。该基准整合了瑞士法律体系中的多样化法律自然语言处理数据集，可对非英语、天生多语言的联邦法律体系进行综合研究。尽管近期取得了进展，但高效处理长文档以完成高强度审查/分析任务仍是语言模型面临的开放挑战。此外，需要高水平专业知识构建的综合性领域特定基准，以及多语言基准均十分稀缺。考虑到大多数公开模型主要基于英语语料库训练，而其他语言（尤其是实际领域特定的自然语言处理任务）仍缺乏研究，这一短缺更凸显了本项工作的价值。本基准可用于测试并推动最先进LLM的发展。作为研究的一部分，我们在基准上评估了多个预训练多语言语言模型，建立了强基线作为参考点。尽管数据集规模庞大（涵盖数万至数十万个样本），现有公开模型在即使经过领域内预训练后，仍难以胜任大多数任务。我们将所有资源（基准套件、预训练模型、代码）以完全开放的CC BY-SA许可协议发布。