Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

Existing research predominantly focuses on developing powerful language learning models (LLMs) for mathematical reasoning within monolingual languages, with few explorations in preserving efficacy in a multilingual context. To bridge this gap, this paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. Firstly, by utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages, thus addressing the issue of training data scarcity in xMR tasks. Based on the collected dataset, we propose different training strategies to build powerful xMR LLMs, named MathOctopus, notably outperform conventional open-source LLMs and exhibit superiority over ChatGPT in few-shot scenarios. Notably, MathOctopus-13B reaches 47.6% accuracy which exceeds ChatGPT 46.3% on MGSM testset. Beyond remarkable results, we unearth several pivotal observations and insights from extensive experiments: (1) When extending the rejection sampling strategy to the multilingual context, it proves effective for model performances, albeit limited. (2) Employing parallel corpora for math Supervised Fine-Tuning (SFT) across multiple languages not only significantly enhances model performance multilingually but also elevates their monolingual performance. This indicates that crafting multilingual corpora can be regarded as a vital strategy for enhancing model performance in a specific language, especially in mathematical reasoning tasks. For instance, MathOctopus-7B improves its counterparts that trained on English from 42.2% to 50.8% on GSM8K testset.

翻译：现有研究主要聚焦于在单语环境下开发用于数学推理的强大语言学习模型（LLMs），而关于在多语言语境中保持模型效能的研究则相对匮乏。为填补这一空白，本文率先探索并训练了强大的多语言数学推理（xMR）LLM。首先，通过翻译技术，我们构建了首个多语言数学推理指令数据集MGSM8KInstruct，涵盖十种不同语言，从而解决了xMR任务中训练数据稀缺的问题。基于该数据集，我们提出了多种训练策略以构建强大的xMR LLM——命名为MathOctopus，其性能显著超越传统开源LLM，并在少样本场景下优于ChatGPT。值得注意的是，MathOctopus-13B在MGSM测试集上的准确率高达47.6%，超过ChatGPT的46.3%。除显著成果外，我们通过大量实验揭示了若干关键发现与洞见：（1）将拒绝采样策略扩展至多语言语境虽能有效提升模型性能，但效果仍有限。（2）在多语言环境下采用平行语料进行数学监督微调（SFT），不仅能显著提升模型的多语言表现，还能增强其单语性能。这表明构建多语言语料可作为提升特定语言模型性能的重要策略，尤其在数学推理任务中。例如，MathOctopus-7B在GSM8K测试集上将其仅用英语训练的同类模型性能从42.2%提升至50.8%。