NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian

Recent advancements in Generative Language Models (GLMs) have transformed Natural Language Processing (NLP) by showcasing the effectiveness of the "pre-train, prompt, and predict" paradigm in utilizing pre-trained GLM knowledge for diverse applications. Despite their potential, these capabilities lack adequate quantitative characterization due to the absence of comprehensive benchmarks, particularly for low-resource languages. Existing low-resource benchmarks focus on discriminative language models like BERT, neglecting the evaluation of generative language models. Moreover, current benchmarks often overlook measuring generalization performance across multiple tasks, a crucial metric for GLMs. To bridge these gaps, we introduce NLEBench, a comprehensive benchmark tailored for evaluating natural language generation capabilities in Norwegian, a low-resource language. We use Norwegian as a case study to explore whether current GLMs and benchmarks in mainstream languages like English can reveal the unique characteristics of underrepresented languages. NLEBench encompasses a suite of real-world NLP tasks ranging from news storytelling, summarization, open-domain conversation, natural language understanding, instruction fine-tuning, toxicity and bias evaluation, to self-curated Chain-of-Thought investigation. It features two high-quality, human-annotated datasets: an instruction dataset covering traditional Norwegian cultures, idioms, slang, and special expressions, and a document-grounded multi-label dataset for topic classification, question answering, and summarization. This paper also introduces foundational Norwegian Generative Language Models (NorGLMs) developed with diverse parameter scales and Transformer-based architectures. Systematic evaluations on the proposed benchmark suite provide insights into the capabilities and scalability of NorGLMs across various downstream tasks.

翻译：近年来，生成式语言模型（GLMs）的进展通过展示“预训练、提示与预测”范式在利用预训练GLM知识进行多样化应用中的有效性，彻底改变了自然语言处理（NLP）。尽管潜力巨大，但由于缺乏全面基准（尤其是针对低资源语言），这些能力尚缺乏充分的定量刻画。现有低资源基准主要关注BERT等判别式语言模型，忽视了对生成式语言模型的评估。此外，当前基准常忽略跨多任务泛化性能（GLM的关键指标）的测量。为填补这些空白，我们提出NLEBench——一个针对挪威语（低资源语言）自然语言生成能力评估的综合性基准。我们以挪威语为案例，探究当前主流语言（如英语）的GLM与基准是否能揭示代表性不足语言的独特特征。NLEBench涵盖一系列真实世界的NLP任务，包括新闻叙事、摘要生成、开放域对话、自然语言理解、指令微调、毒性及偏见评估，以及自主构建的思维链探究。该基准包含两个高质量人工标注数据集：一个涵盖挪威传统文化、习语、俚语及特殊表达的指令数据集，以及一个用于主题分类、问答及摘要生成的文档级多标签数据集。本文还介绍了基于不同参数规模与Transformer架构的挪威语生成语言模型基座（NorGLMs）。在提出的基准套件上进行的系统性评估，为NorGLMs在各下游任务中的能力与可扩展性提供了深入见解。