Multilingual understanding is crucial for the cross-cultural applicability of Large Language Models (LLMs). However, evaluation benchmarks designed for Hong Kong's unique linguistic landscape, which combines Traditional Chinese script with Cantonese as the spoken form and its cultural context, remain underdeveloped. To address this gap, we introduce HKMMLU, a multi-task language understanding benchmark that evaluates Hong Kong's linguistic competence and socio-cultural knowledge. The HKMMLU includes 26,698 multi-choice questions across 66 subjects, organized into four categories: Science, Technology, Engineering, and Mathematics (STEM), Social Sciences, Humanities, and Other. To evaluate the multilingual understanding ability of LLMs, 90,550 Mandarin-Cantonese translation tasks were additionally included. We conduct comprehensive experiments on GPT-4o, Claude 3.7 Sonnet, and 18 open-source LLMs of varying sizes on HKMMLU. The results show that the best-performing model, DeepSeek-V3, struggles to achieve an accuracy of 75\%, significantly lower than that of MMLU and CMMLU. This performance gap highlights the need to improve LLMs' capabilities in Hong Kong-specific language and knowledge domains. Furthermore, we investigate how question language, model size, prompting strategies, and question and reasoning token lengths affect model performance. We anticipate that HKMMLU will significantly advance the development of LLMs in multilingual and cross-cultural contexts, thereby enabling broader and more impactful applications.
翻译:多语言理解对于大语言模型(LLM)的跨文化适用性至关重要。然而,针对香港独特语言环境(其结合了繁体中文书写形式、粤语作为口语形式及其文化背景)设计的评估基准仍然不足。为填补这一空白,我们引入了HKMMLU,这是一个评估香港语言能力与社会文化知识的多任务语言理解基准。HKMMLU包含涵盖66个学科的26,698道多项选择题,分为四个类别:科学、技术、工程与数学(STEM)、社会科学、人文科学及其他。为评估LLM的多语言理解能力,额外纳入了90,550项普通话-粤语翻译任务。我们在HKMMLU上对GPT-4o、Claude 3.7 Sonnet以及18个不同规模的开源LLM进行了全面实验。结果表明,表现最佳的模型DeepSeek-V3也难以达到75%的准确率,显著低于MMLU和CMMLU。这一性能差距凸显了提升LLM在香港特定语言和知识领域能力的必要性。此外,我们研究了问题语言、模型规模、提示策略以及问题和推理令牌长度如何影响模型性能。我们预期HKMMLU将显著推动LLM在多语言和跨文化背景下的发展,从而实现更广泛且更具影响力的应用。