The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce \textbf{LaoBench}, the first large-scale, high-quality, and multidimensional benchmark for assessing LLM language understanding and reasoning in Lao. LaoBench contains \textbf{17,000+} expert-curated samples across three dimensions: culturally grounded knowledge application, curriculum-aligned K12 education, and bilingual translation among Lao, Chinese, and English. It includes open-source and held-out subsets, where the held-out portion enables secure black-box evaluation via a controlled service to improve fairness and data security. We construct LaoBench with a hybrid pipeline that combines expert authoring with agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational validity. We evaluate diverse state-of-the-art open-source and closed-source LLMs, and find that even strong multilingual models lag behind human experts, particularly in culturally grounded reasoning and translation fidelity. We hope LaoBench will catalyze research on Lao and other underrepresented Southeast Asian languages for more inclusive multilingual evaluation.
翻译:大语言模型(LLMs)的快速发展并未与其在低资源语言(尤其是老挝语等东南亚语言)上的评估进展相匹配。为填补这一空白,我们提出了 **LaoBench**——首个用于评估大语言模型老挝语语言理解与推理能力的大规模、高质量、多维度基准测试集。LaoBench 包含 **17,000+** 个经专家筛选的样本,涵盖三个维度:文化背景知识应用、与课程体系对齐的K12教育、以及老挝语、中文和英语之间的双语翻译。该基准集包含开源子集和保留子集,其中保留部分可通过受控服务进行安全的黑盒评估,以提升公平性与数据安全性。我们采用专家撰写与智能体辅助验证相结合的混合流程构建 LaoBench,确保语言准确性、文化相关性和教育有效性。我们对多种先进的开放与闭源大语言模型进行了评估,发现即使是强大的多语言模型也落后于人类专家,尤其在文化背景推理和翻译忠实度方面。我们希望 LaoBench 能够推动针对老挝语及其他代表性不足的东南亚语言的研究,促进更具包容性的多语言评估。