The performance of Language Models (LMs) on lower-resource, morphologically rich languages like Sinhala remains under-explored, particularly for Romanized Sinhala, which is prevalent in digital communication. This paper presents a comprehensive benchmark of modern LMs on a diverse corpus of Unicode and Romanized Sinhala. We evaluate open-source models using perplexity, a measure of how well a model predicts a text, and leading closed-source models via a qualitative analysis of sentence completion. Our findings reveal that the Mistral-Nemo-Base-2407 model achieves the strongest predictive performance on Unicode text and the Mistral-7B-v0.3 model for Romanized text. The results also highlight the strong all-around performance of the Llama-3.1-8B model for both scripts. Furthermore, a significant performance disparity exists among closed-source models: Gemini-1.5-pro and DeepSeek excel at Unicode generation, whereas Claude-3.5-Sonnet is superior at handling Romanized text. These results provide an essential guide for practitioners selecting models for Sinhala-specific applications and highlight the critical role of training data in handling script variations.
翻译:语言模型(LMs)在僧伽罗语这类资源较少、形态丰富的语言上的性能仍未得到充分探索,尤其是在数字通信中普遍使用的罗马化僧伽罗语方面。本文针对多样化的Unicode与罗马化僧伽罗语语料库,对现代语言模型进行了全面的基准测试。我们通过困惑度(衡量模型预测文本能力的指标)评估开源模型,并通过句子补全的定性分析评估领先的闭源模型。我们的研究结果表明,Mistral-Nemo-Base-2407模型在Unicode文本上取得了最强的预测性能,而Mistral-7B-v0.3模型则在罗马化文本上表现最佳。结果同时突显了Llama-3.1-8B模型在两种书写形式上均具备出色的综合性能。此外,闭源模型之间存在显著的性能差异:Gemini-1.5-pro和DeepSeek在Unicode文本生成方面表现优异,而Claude-3.5-Sonnet则在处理罗马化文本时更具优势。这些结果为从业者选择适用于僧伽罗语特定应用的模型提供了重要指导,并凸显了训练数据在处理文字变体方面的关键作用。