While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade-off in accuracy, mid-trace code-switching, and excessive token usage. In this work, we propose AdaMame, a two-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy. The first SFT stage fine-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability. In the subsequent RL stage, we introduce AdaMame-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) in which a query-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame-GRPO achieves Pareto-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out-of-domain, lower-resource languages.
翻译:虽然大型推理模型(LRM)在英语中表现出色,但它们往往无法以查询语言进行推理,这种现象被称为语言崩塌。现有的基于强化学习的修复方案通常会在准确率目标中加入二值语言保真度奖励,但仍会在准确率、中间过程的语码转换以及过度使用词元方面产生权衡。在这项工作中,我们提出AdaMame,一种面向多语言数学推理的两阶段训练策略,通过自适应地将推理语言与查询语言对齐来解决上述局限,且不牺牲准确率。第一阶段为监督微调(SFT),通过五种语言的自然推理轨迹进行微调,以建立多语言推理能力。在后续的强化学习(RL)阶段,我们引入AdaMame-GRPO,这是对群体相对策略优化(GRPO)的一种改进,其中查询条件对齐因子在训练过程中逐步增长,引导模型先探索多种推理语言,再专注于以查询语言进行推理。在涵盖两个基准、两个LRM以及12种语言的评估中,AdaMame-GRPO在所有基线方法中实现了推理准确率、语言保真度与词元效率的帕累托最优表现,其中在领域外、低资源语言上取得了最强提升。