Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MATHMIST, a parallel multilingual benchmark for mathematical problem solving and reasoning. MATHMIST encompasses 2,890 parallel Bangla-English gold standard artifacts, totaling approximately 30K aligned question--answer pairs across thirteen languages, representing an extensive coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models under zero-shot, chain-of-thought (CoT), perturbated reasoning, and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist
翻译:数学推理仍然是大型语言模型最具挑战性的领域之一,它不仅需要语言理解能力,还要求结构化的逻辑演绎和数值精度。尽管近期的大型语言模型展现出强大的通用推理能力,但其在不同语言间的数学能力仍未得到充分探索。现有基准主要集中于英语或少数高资源语言,在评估多语言及跨语言数学推理方面存在显著空白。为此,我们提出了MATHMIST——一个用于数学问题求解与推理的平行多语言基准。MATHMIST包含2,890个平行的孟加拉语-英语黄金标准数据单元,总计约30K个跨十三种语言的对齐问答对,广泛覆盖了高、中、低资源语言环境。该数据集涵盖了语言多样性、多种问题设置类型以及解决方案综合能力。我们系统评估了一系列多样化模型,包括开源中小型大型语言模型、专有系统以及专注于多语言推理的模型,评估范式涵盖零样本、思维链、扰动推理和语码转换推理。研究结果揭示了大型语言模型在跨语言执行一致且可解释的数学推理方面存在持续缺陷,在低资源环境中表现退化尤为明显。所有代码与数据均发布于GitHub:https://github.com/mahbubhimel/MathMist