Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

Existing research predominantly focuses on developing powerful language learning models (LLMs) for mathematical reasoning within monolingual languages, with few explorations in preserving efficacy in a multilingual context. To bridge this gap, this paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. Firstly, by utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages, thus addressing the issue of training data scarcity in xMR tasks. Based on the collected dataset, we propose different training strategies to build powerful xMR LLMs, named MathOctopus, notably outperform conventional open-source LLMs and exhibit superiority over ChatGPT in few-shot scenarios. Notably, MathOctopus-13B reaches 47.6% accuracy which exceeds ChatGPT 46.3% on MGSM testset. Beyond remarkable results, we unearth several pivotal observations and insights from extensive experiments: (1) When extending the rejection sampling strategy to the multilingual context, it proves effective for model performances, albeit limited. (2) Employing parallel corpora for math Supervised Fine-Tuning (SFT) across multiple languages not only significantly enhances model performance multilingually but also elevates their monolingual performance. This indicates that crafting multilingual corpora can be regarded as a vital strategy for enhancing model performance in a specific language, especially in mathematical reasoning tasks. For instance, MathOctopus-7B improves its counterparts that trained on English from 42.2% to 50.8% on GSM8K testset.

翻译：现有研究主要聚焦于在单语言环境下开发强大的语言学习模型（LLMs）进行数学推理，而很少探索如何在多语言环境中保持其有效性。为弥补这一空白，本文率先探索并训练了强大的多语言数学推理（xMR）LLMs。首先，通过利用翻译方法，我们构建了首个多语言数学推理指令数据集MGSM8KInstruct，涵盖十种不同语言，从而解决了xMR任务中训练数据稀缺的问题。基于所收集的数据集，我们提出了不同的训练策略来构建强大的xMR LLMs，命名为MathOctopus，其显著优于传统开源LLMs，并在少样本场景中展现出优于ChatGPT的表现。值得注意的是，MathOctopus-13B在MGSM测试集上达到47.6%的准确率，超过了ChatGPT的46.3%。除了显著成果外，我们通过大量实验揭示了几项关键观察与见解：（1）将拒绝采样策略扩展到多语言环境时，虽对模型性能有效，但效果有限。（2）在多语言环境中使用平行语料进行数学监督微调（SFT），不仅显著提升了模型的多语言性能，还增强了其单语言性能。这表明，构建多语言语料库可被视为提升特定语言模型性能（尤其在数学推理任务中）的关键策略。例如，MathOctopus-7B在GSM8K测试集上将其基于英语训练的对应模型性能从42.2%提升至50.8%。