Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.
翻译:当前的大型推理模型(LRMs)经过基于强化学习(RL)的后训练后,已在具有挑战性的任务上展现出强大能力。然而,先前的研究主要聚焦于英语推理,以期获得最佳性能,尽管多语言思维已显示出潜在优势,且全球用户对母语思维轨迹存在实际需求。本文提出ExpLang,一种新颖的大型语言模型后训练流程,该流程通过启用策略性思维语言选择,在强化学习过程中利用多种语言来改进探索与利用。实验结果表明,在相同训练预算下,我们的方法持续优于仅使用英语的训练,同时对已见和未见语言均表现出高度的思维语言遵循性。分析表明,通过将策略性思维语言选择作为强化学习过程中的一项动作,ExpLang凭借多样化的语言偏好有效扩展了强化学习的探索空间,并借助非英语优势提升了强化学习的利用效果。该方法与大多数强化学习算法正交,为利用多语言性改进大型推理模型开辟了新的视角。