The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.
翻译:大型语言模型(LLM)的评估近期引起了学界的广泛关注。本研究聚焦于在中文语境下评估LLM,特别是现有基准测试中鲜少涉及的繁体中文。我们提出TMLU,这是一套专为评估LLM在台湾华语语境下的高级知识与推理能力而设计的全面评估套件。TMLU包含涵盖社会科学、理工、人文、台湾专题内容等领域的37个学科,范围从中学水平到专业级别。此外,我们为每个学科整理了类似思维链的少样本解释,以促进复杂推理能力的评估。为建立全面的基准,我们对24个先进LLM进行了广泛的实验与分析。结果表明,中文开源模型的性能逊于多语言专有模型,且针对台湾华语优化的开源模型落后于简体中文模型。这些发现揭示了显著的改进空间,并强调了TMLU旨在推动本地化台湾华语LLM发展的目标。我们向学界发布该基准及评估脚本,以促进未来研究。