In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.
翻译:在本工作中,我们介绍了BLUCK,这是一个旨在衡量大语言模型在孟加拉语语言理解与文化知识方面表现的新数据集。我们的数据集包含2366道精心编制的多项选择题,这些题目选自多个大学及职业水平考试的汇编题库,涵盖23个类别,涉及孟加拉国的文化与历史知识以及孟加拉语语言学。我们使用6个专有模型和3个开源LLM对BLUCK进行了基准测试,包括GPT-4o、Claude-3.5-Sonnet、Gemini-1.5-Pro、Llama-3.3-70B-Instruct和DeepSeekV3。我们的结果表明,尽管这些模型整体表现尚可,但在孟加拉语语音学的某些领域仍存在困难。虽然当前LLM在孟加拉文化及语言语境下的表现仍无法与英语等主流语言相媲美,但我们的结果证实了孟加拉语作为一种中等资源语言的地位。重要的是,BLUCK也是首个以本土孟加拉文化、历史和语言学为核心、基于多项选择题的评估基准。