Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimization framework (MAPO), aiming to align the reasoning processes in other languages with the dominant language. Specifically, we harness an off-the-shelf translation model for the consistency between answers in non-dominant and dominant languages, which we adopt as the preference for optimization, e.g., Direct Preference Optimization (DPO) or Proximal Policy Optimization (PPO). Experiments show that MAPO stably achieves significant improvements in the multilingual reasoning of various models on all three benchmarks (MSVAMP +16.2%, MGSM +6.1%, and MNumGLUESub +13.3%), with improved reasoning consistency across languages.
翻译:尽管推理能力被认为是语言无关的,但现有大语言模型在不同语言中表现出不一致的推理能力,例如,由于多语言训练数据的不平衡,在英语等主导语言中的推理能力优于其他语言。为了提升非主导语言中的推理能力,我们提出了一种多语言对齐作为偏好优化框架(MAPO),旨在将其他语言中的推理过程与主导语言对齐。具体而言,我们利用现成的翻译模型来确保非主导语言与主导语言答案之间的一致性,并将其作为优化的偏好(例如直接偏好优化(DPO)或近端策略优化(PPO))。实验表明,MAPO能够在所有三个基准测试(MSVAMP +16.2%、MGSM +6.1%和MNumGLUESub +13.3%)上稳定提升各类模型的多语言推理能力,并改善了跨语言的推理一致性。