Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

Existing research predominantly focuses on developing powerful language learning models (LLMs) for mathematical reasoning within monolingual languages, with few explorations in preserving efficacy in a multilingual context. To bridge this gap, this paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. Firstly, by utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages, thus addressing the issue of training data scarcity in xMR tasks. Based on the collected dataset, we propose different training strategies to build powerful xMR LLMs, named MathOctopus, notably outperform conventional open-source LLMs and exhibit superiority over ChatGPT in few-shot scenarios. Notably, MathOctopus-13B reaches 47.6% accuracy which exceeds ChatGPT 46.3% on MGSM testset. Beyond remarkable results, we unearth several pivotal observations and insights from extensive experiments: (1) When extending the rejection sampling strategy to the multilingual context, it proves effective for model performances, albeit limited. (2) Employing parallel corpora for math Supervised Fine-Tuning (SFT) across multiple languages not only significantly enhances model performance multilingually but also elevates their monolingual performance. This indicates that crafting multilingual corpora can be regarded as a vital strategy for enhancing model performance in a specific language, especially in mathematical reasoning tasks. For instance, MathOctopus-7B improves its counterparts that trained on English from 42.2% to 50.8% on GSM8K testset.

翻译：现有研究主要集中于在单语环境下开发用于数学推理的强大语言学习模型（LLMs），而较少探索如何在多语言语境中保持其有效性。为填补这一空白，本文率先探索并训练了强大的多语言数学推理（xMR）LLMs。首先，借助翻译技术，我们构建了首个多语言数学推理指令数据集MGSM8KInstruct，涵盖十种不同语言，从而解决了xMR任务中训练数据稀缺的问题。基于收集的数据集，我们提出了不同的训练策略以构建强大的xMR LLMs，命名为MathOctopus，其显著优于传统的开源LLMs，并在少样本场景中展现出超越ChatGPT的优越性。值得注意的是，MathOctopus-13B在MGSM测试集上达到了47.6%的准确率，超过了ChatGPT的46.3%。除了显著的结果外，我们通过广泛实验揭示了几项关键观察与见解：（1）将拒绝采样策略扩展到多语言语境中，虽对模型性能有效，但效果有限。（2）在多语言中采用平行语料库进行数学监督微调（SFT），不仅显著提升了模型的多语言性能，还提高了其单语性能。这表明构建多语言语料库可被视为提升特定语言（尤其是数学推理任务）模型性能的关键策略。例如，MathOctopus-7B在GSM8K测试集上，将基于英语训练的对照模型性能从42.2%提升至50.8%。