Large language models (LLMs) demonstrate strong mathematical reasoning in English, but whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear. We examine this fundamental question by evaluating whether LLMs genuinely reason mathematically in these languages or depend on implicit translation to English-like representations. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset where each problem is natively authored by fluent speakers with mathematical training in all three languages. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that apparent multilingual competence may not reflect uniform reasoning capabilities across languages. These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.
翻译:大型语言模型(LLMs)在英语中展现出强大的数学推理能力,但这些能力是否反映了真正的多语言推理,还是在僧伽罗语和泰米尔语等低资源语言中依赖于基于翻译的处理机制,目前尚不明确。我们通过评估LLMs是否真正在这些语言中进行数学推理,还是依赖于隐式翻译到类英语的表征,来探究这一基本问题。利用涵盖从基础算术到复杂单位冲突及优化问题的六类数学问题分类法,我们评估了四个主流大型语言模型。为避免混淆语言能力与翻译质量的翻译伪影,我们构建了一个平行数据集,其中每个问题均由受过数学训练、精通所有三种语言的母语者原生创作。我们的分析表明,尽管基础算术推理能稳健地跨语言迁移,但复杂推理任务在泰米尔语和僧伽罗语中表现出显著退化。失败的模式因模型和问题类型而异,这表明表面的多语言能力可能并不反映跨语言的统一推理能力。这些发现挑战了模型表现出强大多语言性能即能在各语言间同等有效推理的常见假设,并强调了在多语言环境中进行细粒度、类型感知评估的必要性。