Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

Existing research predominantly focuses on developing powerful language learning models (LLMs) for mathematical reasoning within monolingual languages, with few explorations in preserving efficacy in a multilingual context. To bridge this gap, this paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. Firstly, by utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages, thus addressing the issue of training data scarcity in xMR tasks. Based on the collected dataset, we propose different training strategies to build powerful xMR LLMs, named MathOctopus, notably outperform conventional open-source LLMs and exhibit superiority over ChatGPT in few-shot scenarios. Notably, MathOctopus-13B reaches 47.6% accuracy which exceeds ChatGPT 46.3% on MGSM testset. Beyond remarkable results, we unearth several pivotal observations and insights from extensive experiments: (1) When extending the rejection sampling strategy to the multilingual context, it proves effective for model performances, albeit limited. (2) Employing parallel corpora for math Supervised Fine-Tuning (SFT) across multiple languages not only significantly enhances model performance multilingually but also elevates their monolingual performance. This indicates that crafting multilingual corpora can be regarded as a vital strategy for enhancing model performance in a specific language, especially in mathematical reasoning tasks. For instance, MathOctopus-7B improves its counterparts that trained on English from 42.2% to 50.8% on GSM8K testset.

翻译：现有研究主要聚焦于单语言环境下开发用于数学推理的强大语言模型，而在多语言情境中保持其性能的探索尚显不足。为填补这一空白，本文率先探索并训练了强大的多语言数学推理（xMR）语言模型。首先，通过翻译方法，我们构建了首个涵盖十种不同语言的多语言数学推理指令数据集MGSM8KInstruct，从而解决了xMR任务中训练数据稀缺的问题。基于该数据集，我们提出了不同的训练策略，构建出强大的xMR语言模型MathOctopus，其性能显著优于传统开源语言模型，并在少样本场景下展现出优于ChatGPT的表现。值得注意的是，MathOctopus-13B在MGSM测试集上达到47.6%的准确率，超过了ChatGPT的46.3%。除显著成果外，我们通过大规模实验揭示了若干关键观察与洞见：（1）将拒绝采样策略扩展至多语言情境时，虽对模型性能有一定提升效果，但作用有限。（2）在多语言环境下采用平行语料进行数学监督微调（SFT），不仅显著增强了模型的多语言性能，还提升了其在单语言上的表现。这表明，构建多语言语料可被视为提升特定语言（尤其是数学推理任务）模型性能的关键策略。例如，MathOctopus-7B在GSM8K测试集上将从英语训练得到的对应模型性能从42.2%提升至50.8%。