Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.
翻译:大型语言模型(LLMs)在英语数学推理方面展现出卓越能力,但在多语言场景下仍存在显著的性能差距,这主要归因于语言理解方面的不足。为弥合这一差距,我们提出翻译增强策略优化(TAPO)——一种基于GRPO构建的新型强化学习框架。TAPO实施显式对齐策略,模型以英语为枢纽并遵循"先理解后推理"范式。关键创新在于我们采用步骤级相对优势机制,将理解过程与推理过程解耦,从而在引入翻译质量奖励的同时避免优化冲突。大量实验表明,TAPO能有效协同语言理解与推理能力,并兼容多种模型。该框架在多语言数学推理及翻译任务中均优于基线方法,且对未见语言和跨领域任务具有良好的泛化能力。