Recent advancements in Generative Language Models (GLMs) have transformed Natural Language Processing (NLP) by showcasing the effectiveness of the "pre-train, prompt, and predict" paradigm in utilizing pre-trained GLM knowledge for diverse applications. Despite their potential, these capabilities lack adequate quantitative characterization due to the absence of comprehensive benchmarks, particularly for low-resource languages. Existing low-resource benchmarks focus on discriminative language models like BERT, neglecting the evaluation of generative language models. Moreover, current benchmarks often overlook measuring generalization performance across multiple tasks, a crucial metric for GLMs. To bridge these gaps, we introduce NLEBench, a comprehensive benchmark tailored for evaluating natural language generation capabilities in Norwegian, a low-resource language. We use Norwegian as a case study to explore whether current GLMs and benchmarks in mainstream languages like English can reveal the unique characteristics of underrepresented languages. NLEBench encompasses a suite of real-world NLP tasks ranging from news storytelling, summarization, open-domain conversation, natural language understanding, instruction fine-tuning, toxicity and bias evaluation, to self-curated Chain-of-Thought investigation. It features two high-quality, human-annotated datasets: an instruction dataset covering traditional Norwegian cultures, idioms, slang, and special expressions, and a document-grounded multi-label dataset for topic classification, question answering, and summarization. This paper also introduces foundational Norwegian Generative Language Models (NorGLMs) developed with diverse parameter scales and Transformer-based architectures. Systematic evaluations on the proposed benchmark suite provide insights into the capabilities and scalability of NorGLMs across various downstream tasks.
翻译:近年来,生成式语言模型(GLMs)的进展通过展示“预训练、提示与预测”范式在利用预训练GLM知识进行多样化应用中的有效性,彻底改变了自然语言处理(NLP)。尽管潜力巨大,但由于缺乏全面基准(尤其是针对低资源语言),这些能力尚缺乏充分的定量刻画。现有低资源基准主要关注BERT等判别式语言模型,忽视了对生成式语言模型的评估。此外,当前基准常忽略跨多任务泛化性能(GLM的关键指标)的测量。为填补这些空白,我们提出NLEBench——一个针对挪威语(低资源语言)自然语言生成能力评估的综合性基准。我们以挪威语为案例,探究当前主流语言(如英语)的GLM与基准是否能揭示代表性不足语言的独特特征。NLEBench涵盖一系列真实世界的NLP任务,包括新闻叙事、摘要生成、开放域对话、自然语言理解、指令微调、毒性及偏见评估,以及自主构建的思维链探究。该基准包含两个高质量人工标注数据集:一个涵盖挪威传统文化、习语、俚语及特殊表达的指令数据集,以及一个用于主题分类、问答及摘要生成的文档级多标签数据集。本文还介绍了基于不同参数规模与Transformer架构的挪威语生成语言模型基座(NorGLMs)。在提出的基准套件上进行的系统性评估,为NorGLMs在各下游任务中的能力与可扩展性提供了深入见解。