Large-scale multitask benchmarks have driven rapid progress in language modeling, yet most emphasize high-resource languages such as English, leaving Bengali underrepresented. We present BnMMLU, a comprehensive benchmark for measuring massive multitask language understanding in Bengali. BnMMLU spans 41 domains across STEM, humanities, social sciences, and general knowledge, and contains 134,375 multiple-choice question-option pairs--the most extensive Bengali evaluation suite to date. The dataset preserves mathematical content via MathML, and includes BnMMLU-HARD, a compact subset constructed from questions most frequently missed by top systems to stress difficult cases. We benchmark 24 model variants across 11 LLM families, spanning open-weights general/multilingual, Bengali-centric open-weights, and proprietary models, covering multiple parameter scales and instruction-tuned settings. We evaluate models under standardized protocols covering two prompting styles (Direct vs. Chain-of-Thought) and two context regimes (0-shot vs. 5-shot), reporting accuracy consistently across families. Our analysis highlights persistent gaps in reasoning and application skills and indicates sublinear returns to scale across model sizes. We release the dataset and evaluation templates to support rigorous, reproducible assessment of Bengali language understanding and to catalyze progress in multilingual NLP.
翻译:大规模多任务基准推动了语言建模的快速发展,然而大多数基准侧重于英语等高资源语言,导致孟加拉语代表性不足。本文提出了BnMMLU,一个用于衡量孟加拉语大规模多任务语言理解能力的综合性基准。BnMMLU涵盖STEM、人文、社会科学和常识等41个领域,包含134,375个多项选择题-选项对,是迄今为止最广泛的孟加拉语评估套件。该数据集通过MathML保留数学内容,并包含BnMMLU-HARD——一个由顶尖系统最常答错的问题构成的紧凑子集,用于检验困难案例。我们对来自11个LLM家族的24个模型变体进行了基准测试,涵盖开源通用/多语言模型、孟加拉语中心的开源模型以及专有模型,涉及多种参数规模和指令微调设置。我们在标准化协议下评估模型,涵盖两种提示风格(直接提示与思维链)和两种上下文设置(零样本与五样本),并报告了各模型家族间一致的准确率。我们的分析揭示了模型在推理和应用技能方面存在的持续差距,并表明模型规模存在次线性收益。我们公开了数据集和评估模板,以支持对孟加拉语理解能力进行严谨、可复现的评估,并推动多语言自然语言处理领域的发展。