Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs -- including GPT-4o, GPT-4, Claude~3.5~Sonnet, LLaMA~3.1, Mistral~Large~2, LLaMA-2~Chat~13B, and Mistral~7B~Instruct -- on a new cross-lingual benchmark covering \textbf{Cantonese, Japanese, and Turkish}. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine \textbf{human evaluations} (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude~3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude~3.5~Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-2~13B, Mistral~7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research.
翻译:大型语言模型(LLMs)在英语等高资源语言上已取得令人瞩目的成果,但其在低资源及形态丰富语言上的有效性仍待深入探索。本文针对七种前沿LLMs——包括GPT-4o、GPT-4、Claude~3.5~Sonnet、LLaMA~3.1、Mistral~Large~2、LLaMA-2~Chat~13B和Mistral~7B~Instruct——在一个涵盖**粤语、日语和土耳其语**的新型跨语言基准上进行了全面评估。我们的基准覆盖四项多样化任务:开放域问答、文档摘要、英译X翻译及文化情境对话。我们结合**人工评估**(从流畅度、事实准确性和文化适切性三个维度评分)与自动化指标(如BLEU、ROUGE)来评估模型性能。研究结果表明,尽管规模最大的专有模型(GPT-4o、GPT-4、Claude~3.5)在多数语言和任务中普遍领先,但在文化细微差异理解和形态泛化方面仍存在显著差距。值得注意的是,GPT-4o即使在跨语言任务上也展现出稳健的多语言性能,而Claude~3.5~Sonnet在知识与推理基准上达到了具有竞争力的准确度。然而,所有模型在处理各语言特有的语言学挑战时均存在不同程度的困难,例如土耳其语的黏着形态和粤语的口语表达。较小的开源模型(LLaMA-2~13B、Mistral~7B)在流畅度与准确性上明显落后,凸显了资源差距。我们提供了详细的定量结果、定性错误分析,并探讨了开发更具文化感知力与语言泛化能力的LLMs的启示。本研究的基准与评估数据已公开发布,以促进可复现性与进一步研究。